Wisdom will save you from the ways of wicked men, from men whose words are perverse... | |
| Proverbs 2:12 (NIV) |
Some inputs are from untrustable users, so those inputs must be validated (filtered) before being used. You should determine what is legal and reject anything that does not match that definition. Do not do the reverse (identify what is illegal and reject those cases), because you are likely to forget to handle an important case. Limit the maximum character length (and minimum length if appropriate), and be sure to not lose control when such lengths are exceeded (see Chapter 5 for more about this kind of problem, called a buffer overflow).
For strings, identify the legal characters or legal patterns (e.g., as a regular expression) and reject anything not matching that form. There are special problems when strings contain control characters (especially linefeed or NIL) or shell metacharacters; it is often best to ``escape'' such metacharacters immediately when the input is received so that such characters are not accidentally sent. CERT goes further and recommends escaping all characters that aren't in a list of characters not needing escaping [CERT 1998, CMU 1998]. See Section 7.1 for more information on limiting call-outs.
Limit all numbers to the minimum (often zero) and maximum allowed values. Filenames should be checked; usually you will want to not include ``..'' (higher directory) as a legal value. In filenames it's best to prohibit any change in directory, e.g., by not including ``/'' in the set of legal characters. A full email address checker is actually quite complicated, because there are legacy formats that greatly complicate validation if you need to support all of them; see mailaddr(7) and IETF RFC 822 [RFC 822] for more information if such checking is necessary.
The legal character patterns must not include characters or character sequences that have special meaning to the program internals or the eventual output unless you account for them. In particular, if you store data (internally or externally) in delimited strings, make sure that the delimeters are not permitted data values. Here are two common cases:
A character sequence may have special meaning to the program's internal storage format. A number of programs store data in comma (,) or colon (:) delimited text files; inserting such values in the input can be problem unless the program accounts for it. Other characters often causing these problems include single and double quotes (used for surrounding strings) and the less-than sign (used in SGML, XML, and HTML to indicate a tag's beginning). Most data formats have an escape sequence to handle these cases; use it, or filter such data on input.
A character sequence may have special meaning if sent back out to the user. Another common case is permitting HTML tags in data input that will later be posted to other readers (e.g., in a guestbook or ``reader comment'' area). These tags can be used by malicious users to attack other users by inserting Java references (including references to hostile applets), DHTML tags, early document endings (via </HTML>), absurd font size requests, and so on, causing anything from unreadable pages to destructive attacks. It's safest to strip or escape all HTML tags, but at least identify a list of ``safe'' HTML commands and only permit those commands. Common safe HTML tags that might be useful for guestbook or other applications supporting short comments include <P> (paragraph), <B> (bold), <I> (italics), <EM> (emphasis), <STRONG> (strong emphasis), <PRE> (preformatted text), <BR> (forced line break), and <A HREF="safe URI"> (hypertext link), as well as all their ending tags. You might even consider supporting the list-oriented tags, such as <OL> (ordered list), <UL> (unordered list), and <LI> (list item). It's tricky to define ``safe URI''; I'd suggest a pattern like ``(http|ftp)://[-A-Za-z0-9._]+'' (this allows ``..'', which is often fine in this application, but note that it intentionally prevents most query formats and other schemes like ``mailto''). There are more HTML tags, but after a certain point you're really permitting full publishing (in which case you need to trust them or perform more serious checking than will be described here). You really should check if the HTML commands are properly nested (though supporting an implied </P> where not provided before a <P> would be fine), and if you support list tags further checking is warranted.
These tests should usually be centralized in one place so that the validity tests can be easily examined for correctness later.
Make sure that your validity test is actually correct; this is particularly a problem when checking input that will be used by another program (such as a filename, email address, or URL). Often these tests are have subtle errors, producing the so-called ``deputy problem'' (where the checking program makes different assumptions than the program that actually uses the data).
While parsing user input, it's a good idea to temporarily drop all privileges, or even create separate processes (with the parser having permanently dropped privileges, and the other process performing security checks against the parser requests). This is especially true if the parsing task is complex (e.g., if you use a lex-like or yacc-like tool), or if the programming language doesn't protect against buffer overflows (e.g., C and C++). See Section 6.2 for more information on minimizing privileges.
The following subsections discuss different kinds of inputs to a program; note that input includes process state such as environment variables, umask values, and so on. Not all inputs are under the control of an untrusted user, so you need only worry about those inputs that are.