Reading Perl Code

Code Smells

G. Wade Johnson

Campbell Johnson

Code Reading

Code is written to be:

Why Read Code?

notes

Good Code versus Bad Code

You can learn from both good code and bad code.

notes

What is a Code Smell?

A code smell is a symptom of a potential problem in source code. A code smell differs from a bug in two major ways.

  1. The code may actually work, but it is suboptimal.
  2. Sometimes a code smell refers to a piece of code that is ugly because the problem it is solving is ugly.

The code may be suboptimal from a design or architecture viewpoint. I could also violate minor house style issues, or not match the style of the surrounding code.

Sometimes ugly code is the result of solving an ugly problem. This could result in smelly code that is actually correct for the situation.

General Code Smells

The Portland Pattern Repository has a good list of Code Smells that cover general object oriented programming.

notes

Long Method/Subroutine

Long subroutines are hard to test and maintain. The definition of long is context dependent.

notes

Large Class/Package

Large packages are harder to test, understand, and maintain. Large packages usually violate the Single Responsibility Principle.

notes

Duplicate Code

Duplicated code usually begins to diverge over time. This results in code that is subtly inconsistent and bugs that seem to come back after being fixed.

This is why we have subroutines and modules.

Dead Code

If code is never called, it should be removed.

Remember code that does not exist, has no bugs.

Divergent Change

One class/module commonly changed in different ways for different reasons.

This is a symptom of a Single Responsibility Principle violation.

Shotgun Surgery

Changes influence many classes/modules and methods.

This is normally the result of having some functionality or knowledge that is not the responsibility of one piece of code.

notes

Data Clumps

Data that is (almost) always used together and would make a great class.

This smell specifically refers to separate variables or structures that are used together.

This does not refer to a structure, where a series of related pieces of data are in a single hash or other structure.

Long Parameter list

This is a real maintenance problem because later readers of the code will find it hard to remember which parameters are which.

Subroutines with too many parameters are usually violating the Single Responsibility Principle.

Feature Envy

Feature Envy describes a subroutine seems to focus more on manipulating another package than dealing with the package it is in.

This is usually a sign that the subroutine needs to move.

Inappropriate Intimacy

This smell describes a class or function that has too much intimate knowledge about another class/package.

Some packages (or subroutines) use package variables to change the behavior of the package, which forces Inappropriate Intimacy.

It makes the code much more difficult to deal with, due to Action at a Distance issues.

Several of the procedural CPAN modules do this. For example, Data::Dumper has a few package variables that control formatting.

One place where this can work well, is Dependency Injection. This should really only be used to facilitate testing.

Magic Literals

Raw literals in the code are almost never a good idea. Make a named constant instead.

Most people are pretty good about doing this with magic numbers, but they forget that raw strings are also magic literals.

notes

Complex Conditional

The more complex the conditional in an if, unless, while, etc., the harder the code is to read and maintain.

This leads to a number of opportunities for misunderstanding or fixes that break code.

Instead of getting a relatively good understanding at a glance, any reader of the code is going to need to parse the conditional, reason through the logic, try to determine the edge cases handled, and (in a way) simulate the conditional in their head to understand it.

Vague Identifier

Variable, subroutine/method, and package names should be specific and informative.

If you can't come up with a good name, that suggests that the function is not as well defined or understood as it should be.

Inconsistent Names

The code should use the same term for the same concept throughout the code.

Any time the same term is used for different items or two different terms are used for the same item, a reader or maintainer of the code is likely to be confused.

Backpedaling

Backpedaling is a where you execute a piece of code and then go through various indicators, data stores, and state queries to figure out what had happened in the function.

It is an indicator that information is not being communicated correctly.

When code does something, it knows it's doing it. It can communicate that to a function to-be-named-later, who will know with certainty exactly what it's supposed to be doing. A concrete example is a subroutine that uses globals or package variables to return extra data separate from the normal return.

Primitive Obsession

When we use primitive data types to represent domain ideas.

For example, we use a string to represent an exception or an integer to represent an amount of money.

In many cases, there should be constraints (and possibly behavior) associated with the actual idea that are not available with the primitive type. A value object would encapsulate the meaning in a usable name.

Code Junk

Code Junk refers to extraneous, unnecessary syntax added to make sure the code is doing what we intend.

Syntactic noise is often a maintenance issue, because other developers may waste time trying to figure out why anyone used this strange construct.

This can be caused by oddities in a language's design, but more often it's caused by fundamental misunderstandings about the language on the part of the programmer.

Code Junk is language specific. Some syntax is optional in one language but required in another.

Perl Code Smells

Each language has its own code smells.

What follows are some Perl smells that I am aware of.

I developed this list while working on coding standards for cPanel. I sought input from cPanel developers on the smells themselves as well as the names we wanted to use.

C-style for Loop

The C-style for loop is a very flexible approach to looping.


for( my $i = 0; $i < @list; ++$i ) {
  ...
}

You will often see this overused by old C programmers.

C-style for Loop - 2

One of the problems of this approach is the amount of Code Junk devoted to just dealing with the index.

However with minor changes, this loop can be used to

This makes the construct harder to maintain, since anyone reading the code pretty much has to look at the whole loop carefully to determine if there is any funny business going on with $i.

C-style for Loop - 3

Normally, Perl lists can be efficiently processed an element at a time with a list-based for loop.


foreach my $var (@list) {
  ...
}

When used in this form, the each element in turn is aliased by the variable allowing both read and write access without knowing an index into the list.

As such, the maintainer can see, at a glance, that the list is being processed in one direction only, and while it is possible that items might be skipped (using next or last), we are not doing anything weirder with the list order.

C-style for Loop - 4

When you really do need the index, the following construct is more idiomatic than the C-style loop.


foreach my $i ( 0..$#list ) {
      ...
}

This gives the variable $i that can be used to index into @list arbitrarily.

The main difference between this construct and the C-style loop is that there is no possibility that the code could be changing the order in which you are processing the list, which is actually pretty easy with the C-style.

We can actually do quite a bit of the same kind of manipulation by modifying the index list to start with (grep for even indexes, reverse to do the list in backward order, etc.) Since all of those manipulations are in one place, the code is still easier to read.

Since these two styles are more idiomatic, the maintenance programmer is more likely to spend the extra time trying to figure out why the original programmer decided to use the C-style loop. This is wasted time and effort every time a programmer needs to read this code. Additionally, the extra Code Junk in the C-style for loop almost requires reading the entire loop to be certain you know exactly what the loop is doing.

Implicit Return Value

In the absence of a return statement, Perl guarantees that the value of the last expression evaluated is returned.

The problem is that this is often not the expected result.

The most common surprise I have seen is in subroutines that end with an if or unless or a C-style for loop. In all of these cases, the last expression executed is the conditional. Another case is having a print or close as the last statement, making the return value be something that we normally ignore.

Worse, minor changes in the code can result in unexpected changes in return values. For example, putting a print statement at the end of the subroutine changes it's return code.

Schizo Return

Perl's subroutines have a no type associated with their returns, so it is possible to have different parts of a subroutine return different kinds of values.


sub schizo_routine {
     my ($input) = @_;

     return if !defined $input;
     return 0 if !$input;
     #  do cool stuff
     ...
     return scalar @array;
}

From a quick look at the code, I'm not sure what the return is supposed to be. Is it supposed to be a number or a boolean? Is 0 supposed to be the same as undef.

Schizo Return - 2

More egregious examples would include subroutines that:

Other than a programmer trying to be too clever, the most likely causes of this code smell are either the Long Method/Subroutine Code Smell or a violation of the Single Responsibility Principle applied to subroutines.

Schizo Array Return

When people first learn about Perl's ability to change behavior based on context, they often come up with the following clever construction:


return wantarray ? @array : \@array;

How could that possibly go wrong?

At first glance, this appears like a good idea. If we want to use the result of this routine in a for loop or other list context, we get the list directly. If we assign to a scalar, we get a reference.

Schizo Array Return - 2

In reality, Perl provides list context most of the time.


    my $status = do_work( schizo_routine() );

    my $config = {
          'name' => 'Herman Munster',
          'address' => '1313 Mockingbird Lane',
          'task' => schizo_routine(),
    };
  1. Pass result as an argument to a subroutine.
  2. Use result as a value when building a hash.

In both of these cases, we are actually in list context and the result is not usually what the caller expected.

Schizo Array Return - 3

An even worse case is using the subroutine as the condition for an if or unless.


    if( schizo_routine() ) {
         do_work();
    }
    else {
         print "Error\n";
    }

Since this forces scalar context, you will always get a reference.

Since it's a reference, it will always be true. From the fact we are in a conditional that's probably not the expected behavior.

This smell is a special case of Schizo Return.

Since the type of output changes drastically depending on the context where it's called, this code smell causes problems if the developer and the language have different ideas of the actual context in which the code is called. This can result in confusion and wasted time troubleshooting code that should be unambiguous. Worse, the failure modes are not always obvious at a glance. It would probably be better to avoid the special logic and just return one type.

Context Oversensitivity

The wantarray function allows you to perform different behavior depending on the context where the subroutine is called. This can be useful for optimizations where you want to avoid a lot of work when the results will not be used.

Unfortunately, people normally try to use wantarray to do something clever.

That clever thing is almost always odd enough that programmers using the code will be regularly surprised as the code does something unexpected. That surprise is a direct cause of maintenance issues.

The real impact of this construct is in the confusion for maintenance programmers that are trying to use the code. Without a really good understanding of context and the decisions made inside the routine, this code is likely to be misused at least part of the time. Any construct that leads us to generate errors is definitely smelly.

There are very few ways to actually use wantarray in a way that is obvious to the maintenance programmer. One place that I have seen this done well is as an optimization. If I had a function that returned a sorted list, I could see using wantarray to avoid the sort, if we are not called in list context. This would avoid work that is just going to be thrown away in the case that we are calling the code (for instance) in a conditional.

grep in Void Context

Since grep is designed to either build a list or count items in a list, calling it in void context is definitely a code smell.


 grep { $_ < 100 } @scores;

In fact, it is a strong sign that someone is doing something bad in the grep block. More importantly, given greps only real purpose is filtering the list and we are throwing it's output away, the usage is almost certainly a bug.

At best, this would be an expensive no-op. At worst, we expected the output to go somewhere and the code no longer reflects that fact.

map in Void Context

Since map is designed to build a list, calling it in void context is definitely a code smell.


 map { 2*$_ } @values;

In fact, it is a strong sign that someone is doing something bad in the map block. If the block has side effects, we have another code smell (List-modifying map Expression). If the block has no side effects, then the map is an expensive no-op.

At best, this would be an expensive no-op. At worst, we expected the output to go somewhere and the code no longer reflects that fact.

List-modifying grep Expression

Since the $_ variable is aliased to each element in the list, it is actually possible to modify the original list as part of the grep operation.


    my @long_times = grep {
            /(\d+):(\d+)/;
            $_ = $1*60+$2;
            $_ > 120
        } @times;

This may not be obvious to a maintenance programmer and would violate the Principle of Least Astonishment. If you need to modify the original list and filter at the same time, a foreach loop would be clearer.

notes

List-modifying map Expression

Since the $_ variable is aliased to each element in the list, it is actually possible to modify the original list as part of the map operation.


    my @minutes = map {
        /(\d+):(\d+)/;
        $_ = $1*60+$2;
    } @times;

This may not be obvious to a maintenance programmer and would violate the Principle of Least Astonishment. If you need to modify the original list and create a new list at the same time, a foreach loop would be clearer. If you just want to modify the original list, you can use map to modify generate a new list and assign back tot he original array.

Insatiable Match Variables

The match variables $`, $&, or $' collect the string before the match, the string of the match, and the string after the match.

The real problem is that using these once in a program causes a change to the regex engine that causes every regular expression in the system to set these variables.

This means that using any one of these variables causes a decrease in performance in every regular expression in the entire program.

These variables have far-reaching performance impacts. Using even one anywhere in a program causes all regular expressions to be slower.

I think this cost may have been removed in recent versions of Perl.

Bareword File/Directory Handle

Bareword filehandles are global to the current package. Lexical filehandles are scoped.


  open FH, ">", "output.log" or die;
  print FH "Help!\n";
  close FH;

Another advantage is that lexical handles are automatically closed when they go out of scope. Except in a very small number of cases, you should always be able to replace a bareword filehandle with a lexical filehandle.

Complex map Expression

Code using map can be quite clear and easy to read provided that the expression or block is not too complicated.

With a complicated expression, the map quickly becomes unreadable.


    my @output = map {
        my $t = /(\d+):(\d+)(?:(\d+))?/
            ? ($1*3600+$2*60+($3||0)) : 0;
        $t /= 3600;
        $t
    } @input;

If the expression begins needing multiple statements or multiple lines, this smell indicates that the expression may need to be refactored for ease of maintenance. If the code really needs to be that complex, converting the code to a named subroutine gives the advantages of being more readable in the map expression and more testable.

Complex grep Expression

Code using grep can be quite clear and easy to read provided that the expression or block is not too complicated.

Complicated grep expressions can also become almost unreadable in short order.


    my @output = grep {
        /(\d+):(\d+):(\d+)/
            ? (($1*3600+$2*60+($3||0))/3600 < 4.35)
            : length $_
    } @input;

If the expression begins needing multiple statements or multiple lines, this smell indicates that the expression may need to be refactored for ease of maintenance. If the code really needs to be that complex, converting the code to a named subroutine gives the advantages of being more readable in the grep expression and more testable.

Complex sort Expression

Perl's sort operator is quite flexible thanks to its use of a code block to declare the sorting criteria.

This results in a two-part problem.

One-shot Lookup Hash

Everyone is aware that hash lookup is faster than the searching an array (unless the array is small). Unfortunately, people sometimes use that knowledge to make a very bad pessimization.


sub clever_lookup {
    my ($to_find) = @_;
    my %lookup_hash = map { $_ => 1 } @a_long_list;
    return exists $lookup_hash{$to_find};
}

The problem with this code is that the hash construction and lookup are pure overhead. The purpose of using the hash is to avoid scanning the entire array to find your item. Since we walk the entire list anyway, we don't get any benefit from the lookup hash.

One-shot Lookup Hash - 2

Under almost every circumstance, the following would be faster.


sub simple_lookup {
    my ($to_find) = @_;
    return 0 != grep { $_ eq $to_find } @a_long_list;
}

If the hash can be used for multiple lookups, this becomes the Hash for Fast List Search idiom.

Some people might argue that this could be a premature optimization, but in actuality it is the opposite. The main problem with a premature optimization is that it makes the code less clear for no benefit. In this case, we are simplifying code to make it faster and clearer.

If there is an actual need to make this faster, if you can use List::Util::first or a foreach loop to exit early. That may yield a little more performance at the cost of code that is slightly less clear. So, actual performance should be measured before making this change.

This construct is actually quite a bit slower than the correct solution. A quick benchmark ranging from 5 to 78 items showed a factor 10 - 15 improvement, with the larger improvement being in the longer lists. By the way, that was 10 times to 15 times faster, not 10%-15% faster. This also highlights the difference between a premature optimization and a premature pessimization. Not using the One Shot Hash generates a factor of 10 improvement with an improvement in the readability of the code (a premature pessimization not to make the change). Adding code to exit early instead of using grep is approximately a 50% speedup, at the cost of slightly less readable code (a premature optimization unless the speedup in this case is worth the reduction in readability).

Inappropriate sort Variables

The variables $a and $b from the current package are used directly by the sort operator.

In general, you should probably avoid the use of these variable names.

Using these package variables in other places in your code would cause them to change any time you use a sort. Additionally, if you declare $a or $b as lexical variables in a subroutine using a sort block, the block will be looking at the lexical variables, which mask the ones modified by sort. This will not end up doing what you want.

Perl 4 Subroutine Call

In Perl 4, a subroutine call was always prefixed with the & sigil. It allowed the perl binary to recognize a subroutine call without any parameters or parens. This is no longer necessary.

In Perl 5, this does something subtly different.

If a subroutine is called with the & sigil without any arguments, it is called with the exact same arguments as the calling subroutine. Since this is not obvious in any way from the call (and violates the Principle of Least Astonishment), it is confusing to read and maintain.

System Infatuation

You will sometimes find shell scripts converted to Perl, through almost nothing but calls to system().


   system( "chmod +r logfile" );
   system( "grep -i error logfile > errors.txt" );
   system( "wc -l errors.txt" );

This was obviously a shell script, converted by someone who doesn't know Perl. This could be more easily done without a single system().

This worst version of this I've every seen had Perl code that built text files and shell scripts to process the input files, and then passed control to the shell scripts.

Recognizing Code Smells

The ability to recognize code smells is a step on the way to writing better code.

Learning to correct code smells is yet another step.

Most senior programmers have certain constructs that they avoid due to bad experiences in the past. Code smells are just a way of naming some of these practices.

References

notes

Questions?

notes