Conventions and Guidelines for Perl Source Code
Maintainer: Dave Mitchell <davem@fdgroup.com> Class: Internals PDD Number: 7 Version: 1 Status: Developing Last Modified: 6 August 2001 PDD Format: 1 Language: English
Based on an earlier draft which covered only code comments.
None. First version
This document describes the various rules, guidelines and advice for those wishing to contribute to the source code of Perl, in such areas as code structure, naming conventions, comments etc.
One of the criticisms of Perl 5 is that it's source code is impenetrable to newcomers, due to such things as inconsistent or obscure variable naming conventions, lack of comments in the source code, and so on. Hence this document.
We define three classes of conventions. Those that say must are mandatory, and code will not be accepted (apart from in exceptional circumstances) unless it follows these rules. Those that say should are strong guidelines that should normally be be followed unless there is a sensible reason to do otherwise. Finally, where it says may, this is tentative suggestion to be used at your discretion.
Note this this particular PDD makes some recommendations that are specific to the C programming language. This does not preclude Perl being implemented in other languages, but in this case, additional PDDs may need to be authored for the extra language-specific features.
The following must apply:
To ensure that tabs aren't inadvertently used for indentation, the following boilerplate code must appear at the bottom of each source file. (This rule may be rescinded if I'm ever threatened with a lynching....)
/*
* Local variables:
* c-indentation-style: bsd
* c-basic-offset: 4
* indent-tabs-mode: nil
* End:
*
* vim: expandtab shiftwidth=4:
*/
} should
line up with the opening if etc.
} else {
//): some C compilers may choke on them
return (x+y)*2, but no space between function name and following paren,
eg z = foo(x+y)*2
The following should apply
return foo; rather than return (foo);
if (!foo) ... rather than if (foo == FALSE) ... etc.
if (a && (b = c)) ...
#ifndef NO_FEATURE_FOO
x = (a + b) * f(c, d / e)
line(s) and
with one extra indent:
do_arbitrary_function(
list_of_parameters_with_long_names, or_complex_subexpression(
of_more_params, or_expressions + 1
)
);
To enforce the spacing, indenting, and bracing guidelines mentioned above, the following arguments to GNU Indent should be used:
-kr -nce -sc -cp0 -l79 -lc79 -psl -nut -cdw -ncs -lps
This expands out to: -nbad Do not force blank lines after declarations. -bap Force blank lines after procedure bodies. -bbo Prefer to break long lines before boolean operators. -nbc Do not force newlines after commas in declarations -br Put braces on line with if, etc. -brs Put braces on struct declaration line. -c33 Put comments to the right of code in column 33 (not recommended) -cd33 Put declaration comments to the right of code in column 33 -ncdb Do not put comment delimiters on blank lines. -nce Do not cuddle } and else. -cdw Do cuddle do { } while. -ci4 Continuation indent of 4 spaces -cli0 Case label indent of 0 spaces -ncs Do not put a space after a cast operator. -d0 Set indentation of comments not to the right of code to 0 spaces. -di1 Put declaration variables 1 space after their types -nfc1 Do not format comments in the first column as normal. -nfca Do not format any comments -hnl Prefer to break long lines at the position of newlines in the input. -i4 4-space indents -ip0 Indent parameter types in old-style function definitions by 0 spaces. -l79 maximum line length for non-comment lines is 79 spaces. -lc79 maximum line length for comment lines is 79 spaces. -lp maximum line length for non-comment lines is 79 spaces. -npcs Do not put a space after the function in function calls. -nprs Do not put a space after every ´(´ and before every ´)´. -saf Put a space after each for. -sai Put a space after each if. -saw Put a space after each while. -sc Put the `*´ character at the left of comments. -nsob Do not swallow optional blank lines. -nss Do not force a space before the semicolon after certain statements -nut Use spaces instead of tabs. -lps Leave space between `#´ and preprocessor directive.
-psl Put the type of a procedure on the line before its name. (.c files)
or
-npsl Leave a procedure declaration's return type alone (.h files)
Please note that it is also necessary to include all typedef types with the ``-T'' option to ensure that everything is formatted properly.
A script (run_indent.pl) is be provided which runs indent properly for automatically.
The characters making up filenames must be chosen from the ASCII set A-Z,a-z,0-9 plus .-_
An underscore should be used to separate words rather than a hyphen (-). A file should not normally have more than a single '.' in it, and this should be used to denote a suffix of some description. The filename must still be unique if the main part is truncated to 8 characters and any suffix truncated to 3 characters. Ideally, filenames should restricted to 8.3 in the first place, but this is not essential.
Each subsystem foo should supply the following files. This arrangement is based on the assumption that each subsystem will - as far as is practical - present an opaque interface to all other subsystems within the core, as well as to extensions and embeddings.
The top-level structure of the Perl source tarball should be as follows:
/README, etc a few top-level documents
/docs/ Assorted miscellaneous documentation
/docs/pdds/ The current PDDs
/perl/ The source code for Perl itself
/perl/os/foo/ OS-specific source code for operating system foo
/foo/ The source code for other families of binaries (eg /x2p/)
/hints/ per-OS build hints files
/scripts/ scripts needed during the building process
/t/ scripts used by make test
/lib/ perl modules ready for installation
/ext/ perl modules that need compiling
/pod/ src of the Perl man pages etc
plus others as it becomes necessary.
/* file header comments */
#if !defined(PARROT_<FILENAME>_H_GUARD) #define PARROT_<FILENAME>_H_GUARD
/* body of file */
#endif /* PARROT_<FILENAME>_H_GUARD */
new_foo_bar rather than
NewFooBar or (gasp) newfoobar.
create_foo_from_bar() in preference to ct_foo_bar(). Avoid cryptic
abbreviations wherever possible.
pmc_foo(), struct io_bar. They should be further prefixed
with the word 'perl' if they have external visibility or linkage,
namely, non-static functions, plus macros and typedefs etc which appear
in public header files. (Global variables are handled specially; see below.)
For example:
perlpmc_foo()
struct perlio_bar
typedef struct perlio_bar Perlio_bar
#define PERLPMC_readonly_TEST ...
In the specific case of the use of global variables and functions
within a subsystem, convenience macros will be defined (in
foo_private.h) that allow use of the shortened name in the case of
functions (ie pmc_foo() instead of perlpmc_foo()), and hide the
real representation in the case of global variables.
pmc_foo.
Foo_bar. The exception to this is when the first component is a
short abbreviation, in which case the whole first component may be made
uppercase for readability purposes, eg IO_foo rather than
Io_foo. Structures should generally be typedefed.
PMC_foo_FLAG, PMC_bar_FLAG, ....
_FLAG, eg
PMC_readonly_FLAG (although you probably want to use an enum
instead.)
_TEST, eg
if (PMC_readonly_TEST(foo)) ...
_SET, eg
PMC_readonly_SET(foo);
_CLEAR, eg
PMC_readonly_CLEAR(foo);
_MASK,
eg foo &= ~PMC_STATUS_MASK (but see notes on extensibility below).
_SETALL, CLEARALL, _TESTALL or <_TESTANY> suffixes
as appropriate, to indicate aggregate bits, eg
PMC_valid_CLEARALL(foo)
HAS_, eg HAS_BROKEN_FLOCK, HAS_EBCDIC.
IN_, eg PERL_IN_CORE, PERL_IN_PMC, PERL_IN_X2P. Individual
include file visitations should be marked with PERL_IN_FOO_H for
file foo.h
USE_, eg PERL_USE_STDIO, USE_MULTIPLICITY.
DECL_, eg DECL_SAVE_STACK. Note
that macros which implicitly declare and then use variables are strongly
discouraged, unless it is essential for portability or extensibility.
The following are in decreasing preference style-wise, but increasing
preference extensibility-wise.
{ Stack sp = GETSTACK; x = POPSTACK(sp) ... /* sp is an auto variable */
{ DECL_STACK(sp); x = POPSTACK(sp); ... /* sp may or may not be auto */
{ DECL_STACK; x = POPSTACK; ... /* anybody's guess */
All global variables needed for the internal use of a particular subsystem should all be declared within a single struct called foo_globals for subsystem foo. This structure's declaration is placed in the file foo_globals.h. Then somewhere a single compound structure will be declared which has as members the individual structures from each subsystem. Instances of this structure are then defined as a one-off global variable, or as per-thread instances, or whatever is required.
[Actually, three separate structures may be required, for global, per-interpreter and per-thread variables.]
Within an individual subsystem, macros are defined for each global variable of the form GLOBAL_foo (the name being deliberately clunky). So we might for example have the following macros:
/* perl_core.h or similar */
#ifdef HAS_THREADS
# define GLOBALS_BASE (aTHX_->globals)
#else
# define GLOBALS_BASE (Perl_globals)
#endif
/* pmc_private.h */
#define GLOBAL_foo GLOBALS_BASE.pmc.foo
#define GLOBAL_bar GLOBALS_BASE.pmc.bar
... etc ...
The importance of good code documentation cannot be stressed enough. To make your code understandable by others (and indeed by yourself when you come to make changes a year later :-), the following conventions apply to all source files.
Currently no particular format or structure is imposed on the developer file, but it should have as a minimum the following sections:
/* pp_hot.c - like pp.c, this file contains functions that operate
* on the contents of the stack (pp == 'push & pop'), but in this
* case, frequently used ('hot') functions have been moved here
* from pp.c to (hopefully) improve CPU cache hit rates.
*/
/* This section deals with 'arenas', which are chunks of PMCs of
* a particular type that are allocated in one go. Individual
* requests can then be made to grab or release individual PMCs.
* For each type foo, there is a pointer called GLOBAL_arena_foo
* which blah blah....
*/
Often the comment need only be a single line explaining its purpose,
but sometimes more explanation may be needed. For example, ``return an
Integer Foo to its allocation pool'' may be enough to demystify the
function del_I_foo()
Each comment should be of the form
/*=for api apiname entityname[,entityname..] flags ....(TBC)....
comments....
*/
where apiname is the API the entity belongs to, eg pmc, and entity name is the actual name of the function or macro or whatever. Where there is a whole family of entities that have the same properties and can be collectively described with a single comment, a list of entity names can be provided.
TBC ...
/* The loop is partially unrolled here as it makes it a lot faster.
* See the .dev file for the full details
*/
if (FOO_bar_BAZ(**p+*q) <= (r-s[FOZ & FAZ_MASK]) || FLOP_2(z99)) {
/* we're in foo mode: clean up lexicals */
... (20 lines of gibberish) ...
}
else if (...) {
/* we're in bar mode: clean up globals */
... (20 more lines of gibberish) ...
}
else {
/* we're in baz mode: self-destruct */
....
}
If Perl 5 is anything to go by, the lifetime of Perl 6 will be at least seven years. During this period, the source code will undergo many major changes never envisaged by its original authors - cf threads, unicode in perl 5. To this end, Your code should balance out the assumptions that make things possible, fast or small, with the assumptions that make it difficult to change things in future. This is especially important for parts of the code which are exposed through APIs - the requirements of src or binary compatibility for such things as extensions can make it very hard to change things later on.
For example, if you define suitable macros to set/test flags in a struct, then you can later add a second word of flags to the struct without breaking source compatibility. (Although you might still break binary compatibility if you're not careful.) Of the following two methods of setting a common combination of flags, the second doesn't assume that all the flags are contained within a single field:
foo->flags |= (FOO_int_FLAG | FOO_num_FLAG | FOO_str_FLAG);
FOO_valid_value_SETALL(foo);
Similarly, avoid using a char* (or {char*,length}) if it is feasible to later use a PMC* at the same point: cf UTF-8 hash keys in Perl 5.
Of course, private code hidden behind an API can play more fast and loose than code which gets exposed.
Related to extensibility is portability. Perl runs on many, many platforms, and will no doubt be ported to ever more bizarre and obscure ones over time. You should never assume an operating system, processor architecture, endian-ness, word size, or whatever. In particular, don't fall into the any of the following common traps:
Internal data types and their utility functions (especially for strings) should be used over a bare char * whenever possible. Ideally there should be no char * in the source anywhere, and no use of C's standard string library.
Dont assume GNU C, and don't use any GNU extensions unless protected by #ifdefs for non-GNU-C builds.
TBC ... Any contributions welcome !!!
We want Perl to be fast. Very fast. But we also want it to be portable and extensible. Based on the 90/10 principle, (or 80/20, or 95/5, depending on who you speak to), most performance is gained or lost in a few small but critical areas of code. Concentrate your optimisation efforts there.
Note that the most overwhelmingly important factor in performance is in choosing the correct algorithms and data structures in the first place. Any subsequent tweaking of code is secondary to this. Also, any tweaking that is done should as far as possible be platform independent, or at least likely to cause speed-ups in a wide variety of environments, and do no harm elsewhere. Only in exceptional circumstances should assembly ever even be considered, and then only if generic fallback code is made available that can still be used by all other non-optimised platforms.
Probably the dominant factor (circa 2001) that effects processor performance is the cache. Processor clock rates have increased far in excess of of main memory access rates, and the only way for the processor to proceed without stalling is for most of the data items it needs to be found to hand in the cache. It is reckoned that even a 2% cache miss rate can cause a slowdown in the region of 50%. It is for this reason that algorithms and data structures must be designed to be 'cache-friendly'.
A typical cache may have a block size of anywhere between 4 and 256
bytes. When a program attempts to read a word from memory and the word
is already in the cache, then processing continues unaffected.
Otherwise, the processor is typically stalled while a whole contiguous
chunk of main memory is read in and stored in a cache block. Thus,
after incurring the initial time penalty, you then get all the memory
adjacent to the initially read data item for free. Algorithms that make
use of this fact can experience quite dramatic speedups. For example,
the following pathological code ran four times faster on my machine by
simply swapping i and j.
int a[1000][1000];
... (a gets populated) ...
int i,j,k;
for (i=0; i<1000; i++) {
for (j=0; j<1000; j++) {
k += a[j][i];
}
}
This all boils down to: keep things near to each other that get accessed at around the same time. (This is why the important optimisations occur in data structure and algorithm design rather than in the detail of the code.) This rule applies both to the layout of different objects relative to each other, and to the relative positioning of individual fields within a single structure.
If you do put an optimisation in, time it on as many architectures as you can, and be suspicious of it if it slows down on any of them! Perhaps it will be slow on other architectures too (current and future). Perhaps it wasn't so clever after all? If the optimisation is platform specific, you should probably put it in a platform-specific function in a platform-specific file, rather than cluttering the main source with zillions of #ifdefs.
And remember to document it.
Loosely speaking, Perl tends to optimise for speed rather than space, So you may want to code for speed first, then tweak to reclaim some space while not affecting performance.
The section on coding style is based on Perl5's Porting/patching.pod by Daniel Grisinger. The section on naming conventions grew from some suggestions by Paolo Molaro <lupus@lettere.unipd.it>. Other snippets came from various P5Pers. The rest of it is probably my fault.