SQL Style for Data Scientists - 3 Simple Rules

Following on last week’s brief introduction, here are some general rules that I try to follow when coding in a SQL environment. These rules are derived from Joe Celko’s SQL Programming Style and also appear in Simon Holywell’s SQL Style Guide. There will always be exceptions, but you might want to keep these rules in mind when you’re developing your own SQL scripts.



Comment Your Code

Commenting your code should be the best part of your job. It is your opportunity to tell others in the clearest terms possible what your code should do. In fact, it is often helpful to write your comments first before you event start to write your query or function. It always helps to know where you’re going before you start your journey.

Generally speaking, you should try to use the C-style /* and */ marks to open and close your comments whenever possible. Doing so will allow you to write multi-line comments without worrying about adding a -- mark to every single line. If you are adding a quick comment on the same line as your code, though, you can use a -- mark and finish with a newline.

/* Update North Korea's country name */
UPDATE Tools.Countries
   SET country_name = 'Little Rocket Land' -- use the unofficial name
 WHERE country_id = 113;

Comment style is a matter of preference. The key is that you do it consistently and in as a consistent a fashion as possible.

Use White Space to Improve Readability

Do not, under any circumstances, write code like you are writing a paragraph. Your code should be broken up into logical parts. Clauses, for example, should appear on separate lines of code.

Below is an example of a simple SELECT query that uses white space and indentation judiciously.

SELECT
       S.state_name   AS state,
       C.country_name AS country,
       C.iso3         AS country_code
  FROM Tools.States AS S
  LEFT OUTER JOIN Tools.Countries AS C
    ON S.country_id = C.country_id
 ORDER BY
       state,
       country;

Use Standard SQL Functions

To maximize the portability of your code, you should do your best to use standard SQL functions whenever possible. Every database management system puts its own spin on SQL. In SQL Server, for example, you can replace NULL values in a query with a value of your choosing with the ISNULL() function. However, that function is not available anywhere outside SQL Server, so if you write a query in SQL Server using the ISNULL() function and then try to run that query on a MySQL database, you are going to get an error. A better choice would be the COALESCE() function, which is available in all database management systems.

SELECT
       country_id,
       iso,
       country_name,
       COALESCE(iso3, 'Not Available')        AS iso3,
       COALESCE(number_code, 'Not Available') AS number_code,
       phone_code
  FROM Tools.Countries
 WHERE iso <> 'US';

You can find a useful summary of standard and dialect-specific functions here. Many of the system-specific functions can be quite useful, so I wouldn’t say their use should be expressly forbidden. If you are writing code that you think might be used on more than one type of database management system, though, you should do your best to avoid them.

If you have any ideas on how to improve these guidelines, please feel free to leave a comment below. I’d love to hear about it.

Written on October 4, 2018