Databases and SQL
A relational database stores information in tables, each of which
has a fixed set of columns and a variable number of records.
A database manager is a program that manipulates information stored
in a database.
We write queries in a specialized language called SQL to extract
information from databases.
Use SELECT… FROM… to get values from a database table.
SQL is case-insensitive (but data is case-sensitive).
The records in a database table are not intrinsically ordered: if we
want to display them in some order, we must specify that explicitly with
ORDER BY.
The values in a database are not guaranteed to be unique: if we want
to eliminate duplicates, we must specify that explicitly as well using
DISTINCT.
Use WHERE to specify conditions that records must meet in order to
be included in a query’s results.
Use AND, OR, and NOT to combine tests.
Filtering is done on whole records, so conditions can use fields
that are not actually displayed.
Write queries incrementally.
Queries can do the usual arithmetic operations on values.
Use UNION to combine the results of two or more queries.
Databases use a special value called NULL to represent missing
information.
Almost all operations on NULL produce NULL.
Queries can test for NULLs using IS NULL and IS NOT NULL.
Use aggregation functions to combine multiple values.
Aggregation functions ignore null
values.
Aggregation happens after filtering.
Use GROUP BY to combine subsets separately.
If no aggregation function is specified for a field, the query may
return an arbitrary value for that field.
Use JOIN to combine data from two tables.
Use table.field notation to refer to fields when doing joins.
Every fact should be represented in a database exactly once.
A join produces all combinations of records from one table with
records from another.
A primary key is a field (or set of fields) whose values uniquely
identify the records in a table.
A foreign key is a field (or set of fields) in one table whose
values are a primary key in another table.
We can eliminate meaningless combinations of records by matching
primary keys and foreign keys between tables.
The most common join condition is matching keys.
Every value in a database should be atomic.
Every record should have a unique primary key.
A database should not contain redundant information.
Units and similar metadata should be stored with the data.
Use CREATE and DROP to create and delete tables.
Use INSERT to add data.
Use UPDATE to modify existing data.
Use DELETE to remove data.
It is simpler and safer to modify data when every record has a
unique primary key.
Do not create dangling references by deleting records that other
records refer to.
General-purpose languages have libraries for accessing
databases.
To connect to a database, a program must use a library specific to
that database manager.
These libraries use a connection-and-cursor model.
Programs can read query results in batches or all at once.
Queries should be written using parameter substitution, not string
formatting.
Data analysis languages have libraries for accessing databases.
To connect to a database, a program must use a library specific to
that database manager.
R’s libraries can be used to directly query or read from a
database.
Programs can read query results in batches or all at once.
Queries should be written using parameter substitution, not string
formatting.
R has multiple helper functions to make working with databases
easier.