This is a quick introduction to joins - mostly for people who are new to the subject.

What is a join?

Fundamentally, a join is a cross product of the rows (records) from all participating tables. That is, all possible combinations of all rows from the tables.

Example:

Say you have two tables, T1 (2 rows) and T2 (3 rows):

T1 T2
'A'

1

'B'

2

3

then a join between T1 and T2 will produce the following result table, T3:

T3  
'A'

1

'A'

2

'A'

3

'B'

1

'B'

2

'B'

3

Though we use only two input tables here, there is no formal limit to the number tables you can join simultaneously.

Declaring joins

In SQL, joins can be specified in two different ways:

  • implicitly, or
  • explicitly with the JOIN keyword.

Even if you've never used JOIN in a query, there's a good chance you have been using joins implicitly. The join whose result we saw previously can be expressed like this:

SELECT * FROM T1, T2  -- implicit join

or like this:

SELECT * FROM T1 CROSS JOIN T2 -- explit join

The two forms are exactly equivalent.

In general, for cross joins, you end up with a result table with RowCount(T1) * RowCount(T2).. * RowCount(Tn) rows and ColumnCount(T1) + ColumnCount(T2).. + ColumnCount(Tn) columns.  In our case, that's 2 * 3 = 6 rows with 1 + 1 = 2 columns.

It's clear from this that results of cross joins can become huge very quickly. Fortunately, most joins work on only a small subset of the complete cross product.

The formal definition for an explicit cross joins is:

<Table> CROSS JOIN <Table>

Combining related information

Usually, joins are not used to build complete cross products but rather to combine related information from various tables.

Lets assume we have our two tables, T1 and T2, again, but this time they also have a common column, Code, whose values correspond between the two tables:

T1   T2  
Code Value Code Value
'Z01'

'A'

'Z01'

1

'Z02'

'B'

'Z02'

2

    'Z02'

3

We can now correlate the data using an implicit join like this:

SELECT * FROM T1, T2 WHERE T1.Code = T2.Code;

The result:

Code T2.Code Value Value_2
'Z01'

'Z01'

1

'A'

'Z02'

'Z02'

2

'B'

'Z02'

'Z02'

3

'B'

The same logical join operation can be expressed explicitly like this:

SELECT * FROM T1 JOIN T2 ON T1.Code = T2.Code;

Except for the column headings and order, the result is equivalent:

Code Value T2.Code T2.Value
'Z01'

'A'

'Z01'

1

'Z02'

'B'

'Z02'

2

'Z02'

'B'

'Z02'

3

In general, the ON clause on the JOIN looks and works just like a WHERE clause on a SELECT, except that an ON clause must specify one or more relations between the two tables being joined. Also, many SQL engines support only the equality operator for joining, but some support others, like >= or <=.

The formal definition for a join with an ON clause is:

<Table> JOIN <Table> ON <conditional expression>,

where <conditional expression>  is usually a form similar to

<Table1.ColumnX = Table2.ColumnY [ AND Table1.ColumnZ = Table2.ColumnW]....>

For joins where the names of the columns being joined on are the same between the participating tables, there's a more compact way to express the same join:

SELECT * FROM T1 JOIN T2 USING(Code);

The result of a JOIN with a USING clause differs slightly from one with an ON clause by listing the columns used for joining only once each:

Code Value T2.Value
'Z01'

'A'

1

'Z02'

'B'

2

'Z02'

'B'

3

The formal definition for a join with a USING clause is:

<Table> JOIN <Table> USING(<column list>);

There's also an ultra-short form, the NATURAL JOIN:

SELECT * FROM T1 NATURAL JOIN T2;

NATURAL JOIN is the same as JOIN USING with the list of columns in the USING clause containing ALL columns that have their names in common between the participating tables. In our case, the join above corresponds to

SELECT * FROM T1 JOIN T2 USING(Code, Value);

However, since - in our case - no values in the Value columns are the same between the two tables, the result is an empty set.

The three previous join types discussed (NATURAL, JOIN ON, JOIN USING) are referred to collectively as inner joins. You are allowed to express that literally specifying the INNER keyword before the JOIN, e.g.:

NATURAL INNER JOIN
INNER JOIN ON...
INNER JOIN USING...

but in all three cases, the INNER keyword is optional.

So far, we have not seen any explicit joins (using the JOIN keyword) that could not have been expressed equally well using an implicit join (SELECT with a list of source tables). Indeed, it is only a matter of user preference which form is used - the two forms have the same performance characteristics on most SQL implementation.

Outer joins

The difference between inner and outer joins has to do with NULL values. If no NULL values are involved in a join*, then an outer join is equivalent to its inner form. By 'involved' in the previous sentence is meant compared among tables to determine whether or not a row should appear in the result.

Note that the NULLs need not be explicit: An empty table expression has a value of NULL, as we shall see shortly.

Generally in SQL, comparing a value of NULL to anything else (including another NULL) gives an undefined result. So for an expression like

WHERE T1.Code = T2.Code

if either T1.Code or T2.Code (or both) is NULL, then the result of the expression will become False*. For a join, this means that there will be no rows in the result table for source rows where any of the columns participating in the join expression are null. This is what you'd intuitively expect, so that's fine for most situations. Sometimes,

however, it is nice to get a row in the result that represents missing information in the source tables.

Or rather Not True (i.e. Unknown). Also note that - although not relevant right here - NULL <> NULL is Not True as well.

As an example, if you have a customer table and an orders table, you can join the two giving a result with all customer and order information combined:

Customers  
CustNo Name
1

IBM

2

Oracle

3

Informix

Orders    
OrderNo CustNo OrderAmount
1

1

100,000

2

1

250,000

3

3

10,000

4

4

5,000

SELECT * FROM Customers JOIN Orders USING(CustNo);

CustNo Name OrderNo OrderAmount
1

IBM

1

100,000

1

IBM

2

250,000

3

Informix

3

10,000

But what if we would like all customers to appear - even the ones with no orders? That's where outer joins come in. An outer join is like an inner join except we get to specify that rows from either the left, the right, or both tables in the join with no matching record in the 'opposite' table should appear in the result.

Example:

SELECT * FROM Customers LEFT OUTER JOIN Orders USING(CustNo);

CustNo  

Name    OrderNo OrderAmount
1

IBM

1

100,000

1

IBM

2

250,000

2

Oracle

<null>

<null>

3

Informix

3

10,000

Here, by specifying LEFT, we've said that rows from the left-hand table in the join (Customers) with no corresponding row in the right-hand table (Orders) - when considering the join operator (CustNo) - should appear in the result. Since there is no data from Orders to combine the Customer data from such rows with, the columns for Orders all become <null>.

Similarly, if we want to list all customer/order combinations, plus all orders with no matching customer, we can specify a RIGHT join:

SELECT * FROM Customers RIGHT OUTER JOIN Orders USING(CustNo);

CustNo Name OrderNo OrderAmount
1

IBM

1

100,000

1

IBM

2

250,000

3

Informix

3

10,000

<null>

<null>

4

5,000

Finally, we can ask for a combined LEFT and RIGHT join known as a FULL join:

SELECT * FROM Customers FULL OUTER JOIN Orders USING(CustNo);

CustNo Name OrderNo OrderAmount
1

IBM

1

100,000

1

IBM

2

250,000

2

Oracle

<null>

<null>

3

Informix

3

10,000

<null>

<null>

4

5,000

Note that In all three cases, since LEFT, RIGHT, and FULL all imply outer joins, the OUTER keyword is actually obsolete and can be omitted.

Note also that I've specified a USING clause in all examples because I happen to use a matching column name between the tables, but an ON clause or a NATURAL clause would work the same in principle.

The formal definitions for an outer joins is

<Table1> [NATURAL] [LEFT | RIGHT | FULL] [OUTER] JOIN <Table2> ....

When and why use explicit joins?

Joins can often be executed more efficient than their non-join equivalent. Consider the following example:

Say we want to list the names of all customers with orders. A typical non-join solution might look like this:

SELECT Name FROM Customers WHERE
EXISTS (SELECT * FROM Orders WHERE Orders.CustNo = Customers.CustNo);

This query uses what is known as a correlated subquery. What that means is that the subquery (the part in the parenthesis) refers to something from the enclosing expression (Customers.CustNo). Query engines will typically process this query by iterating over the rows in the Customers table, executing the subquery for each one. Since there is a certain fixed overhead involved in executing any query (including a sub-query), the following solution - without the subquery - will almost always be considerably faster:

SELECT DISTINCT Name FROM Customers JOIN Orders USING(CustNo);

Home | Site Contents | Documentation | FAQ, Tips & Tricks | NexusDB FAQ | SQL