Perl Data Structures

Tips, Tricks, and Traps

G. Wade Johnson

Perl Data Structure Basics

Scalar - stores one thing
Array - an ordered group of zero or more scalars
Hash - a group of zero or more scalars, indexed by strings

There are 3 major types of data structures in Perl. There are a few other things that might count as data structures, but you may never run into them. We'll skip those others in this talk.

Scalar

Number: integer or floating point number
String
Reference
undef
Others you probably won't use

A scalar can store integers or IEEE floating point numbers. Scalars with numeric values act pretty much the way you would expect coming from any other language.

A scalar can also store a string. The strings are not particularly limited in size by anything except available memory. Unlike some languages (Java), Perl strings are mutable.

A Perl reference is basically a pointer or reference to a data structure or code. The undef value is a special value that means the scalar has not been set to anything at this time. Perl provides ways to unset a value, so undef is not quite the same as never used.

Scalar: Operations

Arithmetic operations: +, -, *, /, %, ++, --
String operations: x, .
Numeric comparisons: ==, !=, <, >, <=, >=
String comparisons: eq, ne, lt, gt, le, ge
Logical Operations: !, &&, ||, not, and, or
Others...

Most of the scalar operations are pretty much what you would expect from any dynamic language.

Perl supports all of the normal arithmetic operations. They work on integers and floats, and combinations of the two.

The two string operations are repeat and concatenate. The repeat operator (x) makes a new string that consists of the left argument repeated the number of times supplied by the right argument. The . operator, on the other hand, makes a new string by concatenating the left argument with the right argument.

Scalar: Example


    my $num = 1;
    $num += 2;

    my $str = 'hip ' x 2;
    $str .= ', hooray!'

This little piece of example code doesn't really do much of anything except show how scalar variables are set and manipulated a bit. The my keyword specifies a lexical variable, which is what most languages call a local variable.

Array: Zero or More Scalars

They don't all have to contain the same type of data
Sparse array - some values are undef
Array is ordered
Array is accessed by a 0-based integer
Array knows its length
Grows or shrinks as necessary

Arrays contain lists of scalars. They are mutable and handle memory as they grow and/or shrink.

Array: Operations

Indexing with []
Add/remove from beginning of array: shift/unshift
Add/remove from the end of the array: push/pop
Iterate over with for/foreach

Retrieve items from an array with the $array[0] syntax. Perl arrays support easy operators for adding/removing from the beginning/left of an array or the end/right end of the array. Perl automatically handles expanding or contracting the memory used.

The for statement can be used to iterate over a list. The two names for and foreach are aliases. There is no functional difference between them.

Array: Example


   my @array = (1, 2, 3);
   push @array, 'a';
   my $one = shift @array;
   for my $e (@array)
   {
      say $e;
   }

Another simple example which shows some very simple code to set and manipulate arrays.

Hash

Indexed by a string
Can retrieve list of keys or list of values
Order of keys is not guaranteed

Called different things in different languages: associative array, hash, dictionary, hashmap, etc. You can store 0 or more scalars in a hash, indexed or mapped, by strings. The key is actually a string, not a scalar.

There are hash operations for returning the list of keys in a hash and the list of values in a hash. The order of items in either list is not guaranteed and can change over the life of the hash.

Hash: Operations

Indexing with {}
Automatic conversion between list and hash
Test: exists
Remove: delete

The {} characters mark the fact that we are indexing into a hash.

If you assign a list to a hash, Perl treats the list items as alternating keys and values. If you use a hash in list context, Perl returns a list that alternates keys and values. The order of the keys is not defined, but each key is followed by its value.

The exists operation can tell if a particular key is in the hash. The delete operation is used to remove a key/value pair from a hash (returning the value).

Hash: Example


    my %hash = (
        david   => 'David Shenk',
        kirsten => 'Kirsten Austin',
        mark    => 'Moranderan',
        connie  => 'Connie Ronin',
    );
    say $_ foreach sort keys %hash;
    say "David's full name is '$hash{david}'";

Another example that basically shows a little of the syntax of working with hashes.

Scalars: Implementation

1-3 more pointers
reference count
some flags, etc.
optional lengths

Let's dig into the implementation of a scalar a little. A Perl scalar is actually a data structure of its own that contains several pieces. There are 1-3 pointers to underlying data. There is a reference count used to help with Perl's memory management. There are a handful of flags that tell Perl what the data in the structure means. Finally, there are a couple of optional lengths.

Most of the time, you don't need to know anything about this internal structure. However, once in a great while, these internals may explain behaviors that might be confusing.

Explore the internals a bit using Devel::Peek to dump scalars after performing various operations on them.

Scalar Tips

Can store an integer or a floating point value.
Can store a string.
Can store both a number and a string at the same time.
If Perl needs to auto-convert, it keeps both forms.

Most of these tips just follow up on the implications of the data on the previous slide.

Scalar Tips

Pre-increment (++$i) vs post-increment ($i++)
Work on integers and floats
Work on strings "aA0"
Pre-decrement (--$i) vs post-decrement ($i--)
Only works on integers and floats

The difference between pre-increment(decrement) and post-increment(decrement) is that the pre-form returns the new value and the post-form returns the old value of the scalar. In both cases, the operators make the same changes to the scalar.

The increment operators also work on strings. Explore the behavior in the REPL. Part of the reason for this is that the increment operation is used while managing the range operator ...

Scalar Trap

Post-increment (and post-decrement) creates a new scalar.
Not a big deal unless you are in a tight loop

Since post-increment(decrement) return the old value when they are updating the scalar, Perl must create a new scalar to hold the old value. If you are not using the old value, this is an unnecessary create/destroy of a temporary scalar. In most circumstances, this cost is so small, you don't really need to care about it. But, in a really tight loop, it could have an impact on running code.

Array Implementation

Pointer to the array of scalars
Flags
Lengths
Optional iterator

Once again, use Devel::Peek to explore the internals of the array. Show the different pieces, but there's not as much need to describe in this case.

Discuss the allocation strategy a bit, mostly to explain why we don't do an allocation on each change in the size of the array.

Array Tips

Negative indices count into the array from the right
Iterate over the elements of an array rather than indexes
$#array gives the right-most index in array
You can assign to $#array to change the size of the array

Negative indexes can be useful rather than working with the current length. Iterating over elements is safer than trying to walk indexes. No chance of accidentally walking off the end of the array (and extending it). Also anyone reading the code can tell the code will touch all elements, without having to read the code to figure out what is happening.

Resetting the final index can be used for truncating array. It's more useful for pre-allocating a large array to avoid multiple re-allocations.

Array Traps

Indexing off the end of an array extends it
Using a negative index off the beginning of the array is an error
delete removes elements from the end of the array
delete replaces elements with undef if not at the end

Cannot accidentally index off the front of the array, since that's almost never what you want. Indexing off the end of the array, extends, which is likely what you wanted. Once again, a case of Perl attempting to do what you intend, at the expense of consistency.

Array Trick: List Assignment


   ($a, $b, $c) = (1, 2, 3);
   ($a, undef, $c) = func();
   ($first, $second, @rest) = func();
   my $count = ()= func();

If the left side of an assignment is a list of variables surrounded by parentheses, Perl assigns individual values from the right-hand list into the variables in the left-hand list. Entries on the left that are specified as undef are skipped from the assignment. If there are more items on the right than on the left, the extra values are not assigned. If an item in the left-hand list is an array, it takes all remaining items.

One special feature of list assignment occurs if it is performed in scalar context, it returns the number of items from the right-hand list. This gives the somewhat odd syntax in the last item which executes func() in list context (because of the list assignment), does not save any of the returned values, but counts those values and assigns that count to $count.

Array Complexity

Array access is a constant time operation
Searching an array is a linear time operation
Extending an array is an amortized constant time operation

A little information about the cost of using arrays.

The extension complexity is caused by the way extra memory is allocated to reduce the cost of each new element added. The trade-off is a larger amount of memory may be allocated than you will need.

Hash Tips

keys, values, and each return those items out of order
- return items in an order consistent with each other
Adding or deleting items will probably change the order

Wrapping your mind around the facts of the hash keys/values not being in a defined order.

Hash Tips

Missing items return a value of undef
If your hash can contain undefs, use exists

Talk about missing items. There are times when just checking for a value to be defined is enough, sometimes it's not.

Hash Tricks: Sets


   my %set = map { $_ => 1 }
                 qw(Fortran C C++ Perl Forth Java
                    JavaScript Ruby Lisp Rust);
   say 'yes' if $set{'C++'};
   say 'no' unless $set{'C#'};

An example showing the use of a hash as a set of strings. The actual values don't mean anything. We just care about whether or not there is a value associated with the key. Assigning a key multiple times still ends up with only one item, obviously, making this act as a set.

Hash Tricks: Count Words


  my %words;
  my $line;
  while($line = <<>>)
  {
    ++$words{$_} foreach(split /[^A-Za-z]+/, $line);
  }

This example shows taking text from input (using the diamond operator). Using the increment operator on values, which Perl automatically handles correctly if it was unset. We key the hash with the words that we get from the split. The result is pretty concise, and not too hard to understand once you understand hashes and the basics.

Hash Trick: Transpose


   my %sounds = qw(dog bark cat meow bunny thump snake hiss);
   my %animal_makes = reverse %sounds;

One unusual side effect of the conversion between hashes and lists is the way reverse can be used to swap the keys and values in a hash. This only works well if none of the values are duplicates, of course.

Treating the hash as a list returns a list alternating keys and values. Reversing that list converts it to a list alternating values and keys. Assigning to a new hash uses the old values as keys and the old keys as the corresponding values.

Hash Traps

Hash keys are always strings
Anything else is converted into a string to be a key

Keys are strings, not scalars. There are a few different side-effects of that, but mostly it's just something to be aware of. Some side effects:

undef becomes an empty string
The number 1 and the string '1' map to the same thing
References are converted to a string form that is not particularly useful and cannot be used to refer to the original object

Hash Complexity

Hash lookup is a (mostly) constant time operation
Hash element access is a (mostly) constant time operation
Hash extension is an amortized constant time operation

A little information about the cost of using hashes.

The first two are related.

The extension complexity is caused by the way extra memory is allocated to reduce the cost of each new element added. The trade-off is a larger amount of memory may be allocated than you will need.

Data Structure Complexity Trap

Array and hash access are both constant time operations
Array access is faster than hash access
Each hash access
- Perform hash function on key to test.
- Index into hash internals
- If no match, done
- If match, compare key string to (one or more) matching keys
Hash access can be slower than searching an array, for small arrays
Most of the time, the speed difference doesn't matter

A lot of people have the impression that hashes are faster than arrays. That is not completely true. The reality is a bit more complicated. Part of what makes the hash data structure work is a hashing function that converts a string into an index into the underlying data structure. This hashing function takes a small amount of time.

For really short arrays, searching the array for a string is faster than hashing a key and indexing. Under many circumstances the difference in time is negligible, but it certain cases it can make a difference. Unless you know the performance difference actually matters in your case, use the data structure that is the most readable.

If you are doing these lookups in a tight loop, or millions of times, it might be worth measuring the time spent and deciding if it is worth the optimization.

Data Structure Complexity Trap

Building a hash is a linear time operation
Premature pessimization
- Build a hash from a list
- Look up value
- Throw hash away
Running grep over the list would be faster
Saving the hash is worthwhile for doing multiple lookups on a big list

One place where I have regularly seen people use a hash when it definitely has a performance impact is building a hash from a list, looking up one value, and then discarding the list. This is never a reasonable thing to do. It doesn't matter how big the list is, or how fast the hashing function, just searching the list is always faster than this.

The problem is that building the hash requires walking the whole list and performing an operation that is likely more expensive than the comparison on every string, just to create the hash. Then, you do the hash lookup (which costs time). If you immediately throw away the hash, you have gained no benefit. Some people will tout the constant-time lookup of the hash and completely ignore the linear-time construction of the hash.

If you end up using the same hash for multiple lookups, it will eventually become faster to have used the hash. As usual, measure to figure out which trade-off works best for your code.

List Tools: map

Apply functionality to a list


  my @days = qw(Monday Tuesday Wednesday Thursday Friday Saturday Sunday);
  my @short_days = map { substr($_, 0, 3) } @days;

The map modifies a list by applying a function to each element of that list, returning a new list of the modified values. Although the code is applied to each element, it is actually more useful to treat this as modifying the list.

List Tools: grep

Filter a list


   my @lines = <<>>;
   my @long  = grep { length $_ > 100 } @lines;

The grep filters a list to generate a new list by executing the supplied code on each element. Any element that causes the code to return a true value is removed in the returned list.

List Tools: sort

Order a list


   my @words = <<>>;
   my @len_sorted = sort { length $a <=> length $b || $a cmp $b } @words;

The sort function sorts the items in a list, returning the sorted list. With no supplied comparison code, the function sorts ASCIIbetically. If a code block is supplied the two elements to compare are aliased to the variables $a and $b. The code block should return a negative number if the first element should sort before the second, a positive number if the second should sort before the first, or 0/false if the two items are equivalent.

Array Trick: splice

The power tool for arrays
Remove, add, replace items in an array


   my @array = qw(a b c d e f g h i);
   splice @array, 2, 3, qw(C D E E1 E2);

The splice function allows you to manipulate an array in several ways. It can remove items from an array, insert items anywhere in an array, or combine those functions to replace items in the array with different items from a list. The array functions push, pop, shift, unshift and delete can all be simply implemented in terms of splice. But, splice is more flexible.

Modules

Scalar::Util
List::Util, List::MoreUtils
Hash::Util

If you want more tools for manipulating Perl data structures, check out these modules. They provide further functionality for each of the data structure types.

References

perldoc perldata
perldoc perlop

Much of the information I covered in this talk (and quite a bit more) is available on any system that has Perl installed. The first command describes the Perl data structures. The second describes many Perl operators, including the ones I mentioned here.

Schwartzian Transform


   my @files = map  { $_->[0] }
               sort { $a->[1] <=> $b->[1] }
               map  { [ $_, -s $_ ] }
               @files;

A very useful sort optimization called the Schwartzian Transform chains the sort and map functions in an interesting way. The basic idea is that we want to sort a list based on some function that is expensive to calculate.

Using the obvious approach, we will perform the expensive calculation twice for each comparison (left and right side of the comparison). For a list of length n, sort will do about n * log2(n) comparisons (for log2 is the log base 2). In other words, for 1024 items in the list, we will do 1024 * 10 or 10,240 comparisons. Which means we run the expensive function 20,480 times.

Sorting 1024 items compared as strings happens incredibly fast, but if the expensive function took 0.1 seconds to run, the sort would now take 34 minutes to run.

Using the Schwartian Transform, we convert each element in the list into a pair of the element, and the calculated value. This means we only run the expensive function one time for each item of the list. For a list of 1024 items, that would take 1 minute 42 seconds. Now, the sort does a simple comparison all of the times it needs to and the top map, converts the pairs back into the original elements.

This pattern is a little complex, but you should be able to walk through it from bottom to top and understand what it does. In the example on the slide, we are accessing the disk for each comparison, which is not only (relatively) slow, but also variable in time because of the hardware. This is a great example of a slow comparison function that I have seen used in real life.

Scalar Trick: Dual Vars

A scalar can contain both a string and a number
Number returned in numeric context
String in string context
Very rarely useful

Although not used to often, this is an interesting trick. Using the dualvar function in the Scalar::Util module, you can create scalars with both a string and numeric component. An example would be a scalar that contained both the numeric code and printable message for an HTTP response:

my $status = dualvar( 404, "Not Found" );

If you use $status as a number (say by adding 0 to it or comparing it to 404, it has the numeric value. If you were to use it as a string (say by printing or interpolating it into a string), it has the string value.

Houston.pm

March 12, 2020

Perl Data Structures

Tips, Tricks, and Traps

G. Wade Johnson

Perl Data Structure Basics

Scalar

Scalar: Operations

Scalar: Example

Array: Zero or More Scalars

Array: Operations

Array: Example

Hash

Hash: Operations

Hash: Example

Scalars: Implementation

Scalar Tips

Scalar Tips

Scalar Trap

Array Implementation

Array Tips

Array Traps

Array Trick: List Assignment

Array Complexity

Hash Tips

Hash Tips

Hash Tricks: Sets

Hash Tricks: Count Words

Hash Trick: Transpose

Hash Traps

Hash Complexity

Data Structure Complexity Trap

Data Structure Complexity Trap

List Tools: map

List Tools: grep

List Tools: sort

Array Trick: splice

Modules

References

Schwartzian Transform

Scalar Trick: Dual Vars