American Science Institute of Technology  

 

   Data Typing
Home Up Feedback Legal News

 

 

 

Data Typing, Declarations, Variables, and other Objects

Most languages' built-in data types are abstractions of the underlying machine organization and rarely does the language define the types in terms of exact machine representations. For example, an integer variable may be a 16-bit two's complement value on one machine, a 32-bit value on another, or even a 64-bit value. Clearly, a program written to expect 32 or 64 bit integers will malfunction on a machine (or compiler) that only supports 16-bit integers. The reverse can also be true.

One supposed advantage of a high level language is that it abstracts away the machine dependencies that exist in data types. In theory, an integer is an integer is an integer ... In practice, there are short integers, integers, and long integers. Common sizes include eight, sixteen, thirty-two, and even sixty-four bits, with more on the way. Unfortunately, the abstraction the high level language provides can destroy the ability to port a program from one machine to another.

Most modern high level language provide programmers with the ability to define new data types as isomorphisms (synonyms) of existing types. Using this facility, it is possible to define a data type module that provides precise definitions for most data types. For example, you could define the int16 and int32 data types that always use 16 or 32 bits, respectively. By doing so, you can easily guarantee that your programs can easily port between most systems (and their compilers) by simply changing the definition of the int16 and int32 types on the new machine. Consider the following C/C++ example:

On a 16-bit machine:

typedef int int16;
typedef long int32;

On a 32-bit machine:

typedef short int16;
typedef int int32;
Rule:
If a built-in type has different semantics on different architectures or in different compilers, always use a set of type definitions that let you easily change adjust the program to a different architecture. It is dangerous to assume a particular object uses a specific data format (e.g., two's complement binary or IEEE floating point). It is even worse to assume an object has a fixed number of bits. You should avoid using predefined types in a language.
Guideline:
If the data type you are creating depends upon a specific format, use names like int8, int16, int32, int64, real32, real64, and real80 (that is, a type name with the number of bits appended) to denote your types. If the data type does not depend on a specific representation, use a descriptive name (see the next section on naming conventions). Try to avoid the use of types in a language that vary depdning on the underlying machine representation (alas, this is not always possible).

Don't redefine existing types. This may seem like a contradiction to the guideline above, but it really isn't. This statement says that if you have an existing type that uses the name "integer" you should not create a new type named "integer." Doing so would only create confusion. Another programmer, reading your code, may confuse the old "integer" type every time s/he sees a variable of type integer. This applies to existing user types as well as predefined types.

Enforced Rule:
Never redefine an existing type.

Declare all variables, even if the language processor allows implicit declarations. At one time there was a controversy as to whether it was better to have implicitly declared variables or force the user to explicitly declare all variables (e.g., the FORTRAN vs. ALGOL/Pascal crowd). When NASA and JPL lost a Venus probe due to an implicitly declared variable (that just happened to have the wrong type), the "explicitly declare" crowd won the argument. Fortunately, most modern languages require explicit declarations.

Enforced Rule:
Always explicitly declare all variables (and other identifiers) unless the language does not allow this.

Some languages force you to declare all your variables at a given point in a program unit (e.g., Pascal); some languages are more flexible and let you declare variables anywhere in your program as long as you declare them before their first use; other languages do not require that you declare variables at all (see the above rule). Since it is possible to declare symbols at different points in a program, different programmers have developed different conventions concern the position of their declarations. The two most popular conventions are the following:

  • Declare all symbols at the beginning of the associated program unit (function, procedure, etc.).
  • Declare all variables as close as possible to their use.

Logically, the second scheme above would seem to be the best. However, it has one major drawback - although names typically have only a single definition, the program may use them in several different locations. So although you can easily define a variable just prior to its first use, other uses may be hundreds of lines away. The advantage of declaring variables at the beginning of the program unit is that, no matter how far away it is, the programmer always knows where to look to find the variable declarations. If you embed the definition in the middle of the code nearest the first usage, someone reading the program may have to resort to a "linear search" in order to find the declaration.

Rule:
All variable, constant, and type definitions should occur at the very beginning of the program unit whose limits define the scope of the object.

Unfortunately, not all name definitions are passive, some actually execute code. A instance of a class object in C++ is a good example. The definition of a class object calls the constructor for that class. The constructor may require the computation of some parameter values prior to the object's definition. This would prevent the placement of the definition at the beginning of the module. The solution is rather simple and well within the definition of a "Rule" within this guide:

Rule:
If you cannot define an object at the beginning of the program unit to which it belongs, then put a place-holder comment at the beginning of the block and define the variable as soon as possible within the program unit. You should place a comment near such a definition to remind the reader to update the comment at the beginning of the block if the actual definition ever changes.

Some might argue that certain languages, like C++, provide excellent facilities for declaring otherwise anonymous variables with certain language constructs. For example, the "for ( int i = 0; i < 10; ++i) ..." statement limits the scope of "i" to this for loop. However, the goal of these guidelines is to produce a standard that applies to all languages; making special exceptions for C++ (or some feature-laden language) will only lead to confusion. Besides, C++ lets you create new program units by using "{" and "}" (e.g., the compound statement). Those who absolutely desire to put their definitions as close to the for-loop as possible can always do something like the following:

        // Previous statements in this code...
                .
                .
                .
        {
                int i;
                for (i=start; i <= end; ++k) ... 
        }
                .
                .
                .
        // Additional statements in this code.

Descriptive comments should always accompany a set of variable declarations. These comments should describe the purpose of the variables, provide complete English names for the variables if the names use any abbreviations (see the next section), and describe any constraints or assumptions on the use of these variables. The position of these comments should be immediately before the block or program unit that declares the variables (e.g., in the block of comments preceding a function definition). To improve readability and make it easy for a programmer to locate a particular name while manually scanning through a listing, you should place only one variable declaration per line so the reader can easily find the variable's name while scanning the left-hand side of the list. In languages where the type name precedes the variable name, it's a good idea to put the type name on one line and the variable name (indented) on the next line.

 
Rule:
Associated with any set of variable declarations will be a set of comments known as the "Data Dictionary." This data dictionary will describe the name and purpose for each variable. The Data Dictionary will also describe any constraints or assumptions on the use of the variables.
Guideline:
Variable declarations should appear on separate lines. If desired, the type specification should appear on a separate line as well. Variable and type names should be aligned in columns and easy to find and read.

Examples:

        (* Pascal *)
        var
                LineCnt,                { Number of lines, words, and   } 
                WordCnt,                { and characters in a file.     }
                CharCnt:integer;

        (* Also Reasonable *)

        var
                LineCnt:integer;        { Number of lines, words, and   } 
                WordCnt:integer;        { and characters in a file.     }
                CharCnt:integer;


        /* C/C++  */

        int
                LineCnt,                /* Number of lines, words, and  */
                WordCnt,                /* and characters in a file.    */
                CharCnt;

        /* Another C/C++ Version */

        int     LineCnt;        /* Number of lines, words, and          */
        int     WordCnt;        /* and characters in a file.            */
        float   CharCnt;

 

Names

According to studies done at IBM, the use of high-quality identifiers in a program contributes more to the readability of that program than any other single factor, including high-quality comments. The quality of your identifiers can make or break your program; program with high-quality identifiers can be very easy to read, programs with poor quality identifiers will be very difficult to read. There are very few "tricks" to developing high-quality names; most of the rules are nothing more than plain old-fashion common sense. Unfortunately, programmers (especially C/C++ programmers) have developed many arcane naming conventions that ignore common sense. The biggest obstacle most programmers have to learning how to create good names is an unwillingness to abandon existing conventions. Yet their only defense when quizzed on why they adhere to (existing) bad conventions seems to be "because that's the way I've always done it and that's the way everybody else does it."

Naming conventions represent one area in Computer Science where there are far too many divergent views (program layout is the other principle area). The primary purpose of an object's name in a programming language is to describe the use and/or contents of that object. A secondary consideration may be to describe the type of the object. Programmers use different mechanisms to handle these objectives. Unfortunately, there are far too many "conventions" in place, it would be asking too much to expect any one programmer to follow several different standards. Therefore, this standard will apply across all languages as much as possible.

The vast majority of programmers know only one language - English. Some programmers know English as a second language and may not be familiar with a common non-English phrase that is not in their own language (e.g., rendezvous). Since English is the common language of most programmers, all identifiers should use easily recognizable English words and phrases.

Rule:
All identifiers that represent words or phrases must be English words or phrases.

Alphabetic Case Considerations

A case-neutral identifier will work properly whether you compile it with a compiler that has case sensitive identifiers or case insensitive identifiers. In practice, this means that all uses of the identifiers must be spelled exactly the same way (including case) and that no other identifier exists whose only difference is the case of the letters in the identifier. For example, if you declare an identifier "Profits This Year" in Pascal (a case-insensitive language), you could legally refer to this variable as "profits This Year" and "PROFITS THIS YEAR". However, this is not a case-neutral usage since a case sensitive language would treat these three identifiers as different names. Conversely, in case-sensitive languages like C/C++, it is possible to create two different identifiers with names like "PROFITS" and "profits" in the program. This is not case-neutral since attempting to use these two identifiers in a case insensitive language (like Pascal) would produce an error since the case-insensitive language would think they were the same name.

Enforced Rule:
All identifiers must be "case-neutral."

Different programmers (especially in different languages) use alphabetic case to denote different objects. For example, a common C/C++ coding convention is to use all upper case to denote a constant, macro, or type definition and to use all lower case to denote variable names or reserved words. Prolog programmers use an initial lower case alphabetic to denote a variable. Other comparable coding conventions exist. Unfortunately, there are so many different conventions that make use of alphabetic case, they are nearly worthless, hence the following rule:

Rule:
You should never use alphabetic case to denote the type, classification, or any other program-related attribute of an identifier (unless the language's syntax specifically requires this).

There are going to be some obvious exceptions to the above rule, this document will cover those exceptions a little later. Alphabetic case does have one very useful purpose in identifiers - it is useful for separating words in a multi-word identifier; more on that subject in a moment.

To produce readable identifiers often requires a multi-word phrase. Natural languages typically use spaces to separate words; we can not, however, use this technique in identifiers. Unfortunately writing multi word identifiers makes them almost impossible to read if you do not do something to distiguish the individual words (Unfortunately writing multiword identifiers makes them almost impossible to read if you do not do something to distinguish the individual words). There are a couple of good conventions in place to solve this problem. This standard's convention is to capitalize the first alphabetic character of each word in the middle of an identifier.

Rule:
Capitalize the first letter of interior words in all multi-word identifiers.

Note that the rule above does not specify whether the first letter of an identifier is upper or lower case. Subject to the other rules governing case, you can elect to use upper or lower case for the first symbol, although you should be consistent throughout your program.

Lower case characters are easier to read than upper case. Identifiers written completely in upper case take almost twice as long to recognize and, therefore, impair the readability of a program. Yes, all upper case does make an identifier stand out. Such emphasis is rarely necessary in real programs. Yes, common C/C++ coding conventions dictate the use of all upper case identifiers. Forget them. They not only make your programs harder to read, they also violate the first rule above.

Rule:
Avoid using all upper case characters in an identifier.

Abbreviations

The primary purpose of an identifier is to describe the use of, or value associated with, that identifier. The best way to create an identifier for an object is to describe that object in English and then create a variable name from that description. Variable names should be meaningful, concise, and non-ambiguous to an average programmer fluent in the English language. Avoid short names. Some research has shown that programs using identifiers whose average length is 10-20 characters are generally easier to debug than programs with substantially shorter or longer identifiers.

Avoid abbreviations as much as possible. What may seem like a perfectly reasonable abbreviation to you may totally confound someone else. Consider the following variable names that have actually appeared in commercial software:

NoEmployees, NoAccounts, pend

The "NoEmployees" and "NoAccounts" variables seem to be boolean variables indicating the presence or absence of employees and accounts. In fact, this particular programmer was using the (perfectly reasonable in the real world) abbreviation of "number" to indicate the number of employees and the number of accounts. The "pend" name referred to a procedure's end rather than any pending operation.

Programmers often use abbreviations in two situations: they're poor typists and they want to reduce the typing effort, or a good descriptive name for an object is simply too long. The former case is an unacceptable reason for using abbreviations. The second case, especially if care is taken, may warrant the occasional use of an abbreviation.

Guideline:
Avoid all identifier abbreviations in your programs. When necessary, use standardized abbreviations or ask someone to review your abbreviations. Whenever you use abbreviations in your programs, create a "data dictionary" in the comments near the names' definition that provides a full name and description for your abbreviation.

The variable names you create should be pronounceable. "NumFiles" is a much better identifier than "NmFls". The first can be spoken, the second you must generally spell out. Avoid homonyms and long names that are identical except for a few syllables. If you choose good names for your identifiers, you should be able to read a program listing over the telephone to a peer without overly confusing that person.

Rule:
All identifiers should be pronounceable (in English) without having to spell out more than one letter.

 

The Position of Components Within an Identifier

When scanning through a listing, most programmers only read the first few characters of an identifier. It is important, therefore, to place the most important information (that defines and makes this identifier unique) in the first few characters of the identifier. So, you should avoid creating several identifiers that all begin with the same phrase or sequence of characters since this will force the programmer to mentally process additional characters in the identifier while reading the listing. Since this slows the reader down, it makes the program harder to read.

Guideline:
Try to make most identifiers unique in the first few character positions of the identifier. This makes the program easier to read.
Corollary:
Never use a numeric suffix to differentiate two names.

Many C/C++ Programmers, especially Microsoft Windows programmers, have adopted a formal naming convention known as "Hungarian Notation." To quote Steve McConnell from Code Complete: "The term 'Hungarian' refers both to the fact that names that follow the convention look like words in a foreign language and to the fact that the creator of the convention, Charles Simonyi, is originally from Hungary." One of the first rules given concerning identifiers stated that all identifiers are to be English names. Do we really want to create "artificially foreign" identifiers? Hungarian notation actually violates another rule as well: names using the Hungarian notation generally have very common prefixes, thus making them harder to read.

Hungarian notation does have a few minor advantages, but the disadvantages far outweigh the advantages. The following list from Code Complete and other sources describes what's wrong with Hungarian notation:

  • Hungarian notation generally defines objects in terms of basic machine types rather than in terms of abstract data types.
  • Hungarian notation combines meaning with representation. One of the primary purposes of high level language is to abstract representation away. For example, if you declare a variable to be of type integer, you shouldn't have to change the variable's name just because you changed its type to real.
  • Hungarian notation encourages lazy, uninformative variable names. Indeed, it is common to find variable names in Windows programs that contain only type prefix characters, without an descriptive name attached.
  • Hungarian notation prefixes the descriptive name with some type information, thus making it harder for the programming to find the descriptive portion of the name.
Guideline:
Avoid using Hungarian notation and any other formal naming convention that attaches low-level type information to the identifier.

Although attaching machine type information to an identifier is generally a bad idea, a well thought-out name can successfully associate some high-level type information with the identifier, especially if the name implies the type or the type information appears as a suffix. For example, names like "PencilCount" and "BytesAvailable" suggest integer values. Likewise, names like "IsReady" and "Busy" indicate boolean values. "KeyCode" and "MiddleInitial" suggest character variables. A name like "StopWatchTime" probably indicates a real value. Likewise, "CustomerName" is probably a string variable. Unfortunately, it isn't always possible to choose a great name that describes both the content and type of an object; this is particularly true when the object is an instance (or definition of) some abstract data type. In such instances, some additional text can improve the identifier. Hungarian notation is a raw attempt at this that, unfortunately, fails for a variety of reasons.

A better solution is to use a suffix phrase to denote the type or class of an identifier. A common UNIX/C convention, for example, is to apply a "_t" suffix to denote a type name (e.g., size_t, key_t, etc.). This convention succeeds over Hungarian notation for several reasons including (1) the "type phrase" is a suffix and doesn't interfere with reading the name, (2) this particular convention specifies the class of the object (const, var, type, function, etc.) rather than a low level type, and (3) It certainly makes sense to change the identifier if it's classification changes.

Guideline:
If you want to differentiate identifiers that are constants, type definitions, and variable names, use the suffixes "_c", "_t", and "_v", respectively.
Rule:
The classification suffix should not be the only component that differentiates two identifiers.

Can we apply this suffix idea to variables and avoid the pitfalls? Sometimes. Consider a high level data type "button" corresponding to a button on a Visual BASIC or Delphi form. A variable name like "CancelButton" makes perfect sense. Likewise, labels appearing on a form could use names like "ETWWLabel" and "EditPageLabel". Note that these suffixes still suffer from the fact that a change in type will require that you change the variable's name. However, changes in high level types are far less common than changes in low-level types, so this shouldn't present a big problem.

Names to Avoid

Avoid using symbols in an identifier that are easily mistaken for other symbols. This includes the sets {"1" (one), "I" (upper case "I"), and "l" (lower case "L")}, {"0" (zero) and "O" (upper case "O")}, {"2" (two) and "Z" (upper case "Z")}, {"5" (five) and "S" (upper case "S")}, and ("6" (six) and "G" (upper case "G")}.

Guideline:
Avoid using symbols in identifiers that are easily mistaken for other symbols (see the list above).

Avoid misleading abbreviations and names. For example, FALSE shouldn't be an identifier that stands for "Failed As a Legitimate Software Engineer." Likewise, you shouldn't compute the amount of free memory available to a program and stuff it into the variable "Profits".

Rule:
Avoid misleading abbreviations and names.

You should avoid names with similar meanings. For example, if you have two variables "InputLine" and "InputLn" that you use for two separate purposes, you will undoubtedly confuse the two when writing or reading the code. If you can swap the names of the two objects and the program still makes sense, you should rename those identifiers. Note that the names do not have to be similar, only their meanings. "InputLine" and "LineBuffer" are obviously different but you can still easily confuse them in a program.

Rule:
Do not use names with similar meanings for different objects in your programs.

In a similar vein, you should avoid using two or more variables that have different meanings but similar names. For example, if you are writing a teacher's grading program you probably wouldn't want to use the name "NumStudents" to indicate the number of students in the class along with the variable "StudentNum" to hold an individual student's ID number. "NumStudents" and "StudentNum" are too similar.

Rule:
Do not use similar names that have different meanings.

Avoid names that sound similar when read aloud, especially out of context. This would include names like "hard" and "heart", "Knew" and "new", etc. Remember the discussion in the section above on abbreviations, you should be able to discuss your problem listing over the telephone with a peer. Names that sound alike make such discussions difficult.

Guideline:
Avoid homonyms in identifiers.

Avoid misspelled words in names and avoid names that are commonly misspelled. Most programmers are notoriously bad spellers (look at some of the comments in our own code!). Spelling words correctly is hard enough, remembering how to spell an identifier incorrectly is even more difficult. Likewise, if a word is often spelled incorrectly, requiring a programer to spell it correctly on each use is probably asking too much.

Guideline:
Avoid misspelled words and names that are often misspelled in identifiers.

If you redefine the name of some library routine in your code, another program will surely confuse your name with the library's version. This is especially true when dealing with standard library routines and APIs.

Enforced Rule:
Do not reuse existing standard library routine names in your program unless you are specifically replacing that routine with one that has similar semantics (i.e., don't reuse the name for a different purpose).

 

 

Hit Counter

Home ] Up ]

Send mail to webmaster@amscitech.com with questions or comments about this web site.
Copyright © 1997 - 2006 American Science Institute of Technology