Jun 11, 2004

Translations

Select * Is Evil

You may be in the habit of using SELECT * in your SQL queries. There are several very good reasons to break this habit.

Self-Documentation Lost

If you've ever had the task of diving into someone else's code in the attempt to make bugfixes, changes or even just understand it then you may have run across SQL queries such as:
SELECT *
FROM products
WHERE prod_id = n;
Very nice, legal SQL, sure. But what are we getting back? 1 field? 10? 100? What are the field names and in which order do they appear? Yes, you may have the luck of having a whole db layout right in front of you, but maybe not. Maybe there is some internal documentation, but maybe it's from last August when the db was built and maybe it's changed since then. Maybe there is no documentation except the database itself. Bad practice? Sure, but welcome to the real world.Either you're going to have to dig through the docs or you're going to have to connect to the db and figure out what the table looks like. However, even though you now have the table layout in front of you, you still don't know what assumptions the original coder was making nor which data he intends to actually use until he uses it!.

Maybe there are 8 fields returned and the one of them is misspelled. If the field name is in the query then you can visually match it up; it's right in front of you. Otherwise you consult the docs. Is it _id or id? Maybe the docs say id and the field name in the table is _id.

Consider:

SELECT id
      ,desc
      ,img_url
      ,category
      ,title
      ,serial
      ,color
      ,mystery_field
FROM products
WHERE prod_id = n;
It takes up more vertical space, sure. But the few extra lines are more than worth it. The query is nicely formatted and completely unambiguous. I know exactly what I'm looking for and what I'm getting back if the query succeeds. The query goes from a being black hole to being what all good code can be: self-documenting

Broken Contract

Tables change. Hopefully these changes are planned out and for a good reason. Sometimes they're not. login changes to username to avoid ambiguity. Your manager orders you to change a bunch of fields because he doesn't like leading underscores. _key changes to anything else because whomever thought _key was a good name for a field in the first place was fired 6 months ago.

When a column name changes, obviously anything that depended on the old name has to change as well. You may ask "What's wrong with this?"

SELECT * FROM products;
Lots, but some of the problems might not be obvious. The problem will be afterwards when your code tries to do something with the login field which is now the username field. The query will run absolutely fine, even though the coder's assumption is now wrong. There is no login field anymore. The database doesn't know anything is wrong because you're just saying "give me everything you've got" and it will comply.

What we want to do is have a contract with the database. We want to tell it exactly what we want and it will tell us whether or not it's there. If we issue the query

SELECT nonexistent_field
FROM products;
our database will tell us right away that it's never heard of this field. For example:
ERROR 1054 (42S22): Unknown column 'nonexistant_field' in 'field list'
or
ERROR: column "nonexistant_field" does not exist

We can fix our code and be confident that as long as the query is syntactically valid and as long as the the code accessing the resultset matches the fields in the query then we won't have any problems accessing the data we need. With SELECT * we lose that contract and even though the queries are completing without error we can no longer be confident about our assumptions.

Size Matters

Why ask for everything when you'd be happy with less? Why send out a request for 10 fields, have the database do that extract work retrieving them, sending them back over the wire and processed and then not use them?

The real problem comes when you SELECT * from a table that is storing more than integers and small pieces of character data. Say you have a user table which contains a username, pass and about_me fields. Maybe when you wrote the app only the first 2 fields existed, so you've been logging users into your webapp like so

SELECT *
FROM users
WHERE username = u
AND pass = PASSWORD(p);
As long as you get 1 row back you can be sure the login is valid. Later on a VARCHAR or TEXT field was entered so your users could store a blurb about themselves. Suddenly instead of 2 small text fields, you're pulling hundreds or thousands of bytes you don't need, and that's just for one row.

Now multiply the above situation by n rows. Say you're listing users on your site, 100 at a time. If you pull out that about_me field needlessly and the average length is say, 250 bytes (about the length of this sentence), then you're wasting 25k worth of memory on 2 machines as well as the bandwidth to move it and the extra processing time it takes to handle the data per page view. Say you get a modest 1 page view per second. That's 1.5MB (25k * 60 kb/s) per minute's worth of data filling your RAM and choking your NIC that is a complete and utter waste. This is on top of all the data you're actually using.

For tables of 1 million records, every byte you make the database process needlessly is a waste of nearly 1MB's worth of RAM, bandwidth and processing time on both the server and client machine. Even if you just pull one unneeded 10 byte field you're flushing almost 10MB down the toilet, right off the bat. You might think "Yeah, but I'll never have 1 million records!" That's a) probably not true and b) the wrong attitude. Make it a habit of writing code the proper, efficient, self-documenting way no matter what the scale.

Out Of Order

Get in the habit of specifying every field, every time. Even though your language might not care what order the fields are in, there are those APIs that do. And even if your language doesn't, there are still assumptions that can be made in code that will break if an existing column is ever reordered or deleted or a new column is ever created or an deleted.

When You Assume...

Any time you fetch an array from a resultset, that is, you depend on the columns to be in a certain order... if the columns ever change your code will break and it will be very difficult to track down the problem. For example:

sql := "SELECT * FROM products WHERE id = 5"
rs := db.query(sql)
if db.error
    error(db.errstr)
    exit
end if
# else, success
id, name = rs.nextrow()
print(id . ":" . name . "\n")
rs.free
This example assumes that the first 2 fields in the products table are id and name. Fine. Now, fast forward 6 months. People move on to different projects. Now imagine an upgrade: products are now associated with a category to better organize things:
ALTER TABLE products
ADD FOREIGN KEY cat_id
  REFERENCES categories(id)
  DEFAULT NULL
AFTER id;

The above code is now broken, because of its assumption of the name column's position. (When this is discovered you'll hear "But it worked before!") This is trivially remedied by enumerating the columns in the query.

Don't Do It

In conclusion, there are many reasons not to use SELECT *. Even though SQL is written for a machine it must be clear and concise to other people. So take the extra 20 seconds and enumerate all the fields you really need in your SELECT clauses and your database, application and users will thank you for it; and you might just find that a month from now you have some idea of what your own query is really doing :)

Comments

Ryan Flynn is a programmer and problem solver.