Selecting Data


  • A relational database stores information in tables, each of which has a fixed set of columns and a variable number of records.
  • A database manager is a program that manipulates information stored in a database.
  • We write queries in a specialized language called SQL to extract information from databases.
  • Use SELECT… FROM… to get values from a database table.
  • SQL is case-insensitive (but data is case-sensitive).

Sorting and Removing Duplicates


  • The records in a database table are not intrinsically ordered: if we want to display them in some order, we must specify that explicitly with ORDER BY.
  • The values in a database are not guaranteed to be unique: if we want to eliminate duplicates, we must specify that explicitly as well using DISTINCT.

Filtering


  • Use WHERE to specify conditions that records must meet in order to be included in a query’s results.
  • Use AND, OR, and NOT to combine tests.
  • Filtering is done on whole records, so conditions can use fields that are not actually displayed.
  • Write queries incrementally.

Calculating New Values


  • Queries can do the usual arithmetic operations on values.
  • Use UNION to combine the results of two or more queries.

Missing Data


  • Databases use a special value called NULL to represent missing information.
  • Almost all operations on NULL produce NULL.
  • Queries can test for NULLs using IS NULL and IS NOT NULL.

Aggregation


  • Use aggregation functions to combine multiple values.
  • Aggregation functions ignore null values.
  • Aggregation happens after filtering.
  • Use GROUP BY to combine subsets separately.
  • If no aggregation function is specified for a field, the query may return an arbitrary value for that field.

Combining Data


  • Use JOIN to combine data from two tables.
  • Use table.field notation to refer to fields when doing joins.
  • Every fact should be represented in a database exactly once.
  • A join produces all combinations of records from one table with records from another.
  • A primary key is a field (or set of fields) whose values uniquely identify the records in a table.
  • A foreign key is a field (or set of fields) in one table whose values are a primary key in another table.
  • We can eliminate meaningless combinations of records by matching primary keys and foreign keys between tables.
  • The most common join condition is matching keys.

Data Hygiene


  • Every value in a database should be atomic.
  • Every record should have a unique primary key.
  • A database should not contain redundant information.
  • Units and similar metadata should be stored with the data.

Creating and Modifying Data


  • Use CREATE and DROP to create and delete tables.
  • Use INSERT to add data.
  • Use UPDATE to modify existing data.
  • Use DELETE to remove data.
  • It is simpler and safer to modify data when every record has a unique primary key.
  • Do not create dangling references by deleting records that other records refer to.

Programming with Databases - Python


  • General-purpose languages have libraries for accessing databases.
  • To connect to a database, a program must use a library specific to that database manager.
  • These libraries use a connection-and-cursor model.
  • Programs can read query results in batches or all at once.
  • Queries should be written using parameter substitution, not string formatting.

Programming with Databases - R


  • Data analysis languages have libraries for accessing databases.
  • To connect to a database, a program must use a library specific to that database manager.
  • R’s libraries can be used to directly query or read from a database.
  • Programs can read query results in batches or all at once.
  • Queries should be written using parameter substitution, not string formatting.
  • R has multiple helper functions to make working with databases easier.