The "Empty Member" C++ Optimization

The following article has been accepted for publication in the August '97 "C++ Issue" of Dr. Dobb's Journal.

The "Empty Member" C++ Optimization

The (Draft) Standard C++ library is filled with useful templates, including extended versions of those found in the award-winning STL (See DDJ Mar95). These templates offer great flexibility, yet they are optimized for best performance in the common case. As useful as they are in programs, they are equally useful as examples of effective design, and as a source of inspiration for ways to make your own components as efficient and flexible.

Some of the ways that they offer flexibility involve "empty" classes: classes which have no data members. Ideally, these empty classes should consume no space at all. They typically declare typedefs or member functions, and you can replace them with your own classes (which might not be empty) to handle special needs. The default empty class is by far the most commonly used, however, so this case must be optimal so that we do not all pay for the occasional special needs of a few.

Due to an unfortunate detail of the language definition (explained fully later), instances of empty classes usually occupy storage. In members of other classes, this space overhead can make otherwise- small objects large enough to be unusable in some applications. If this overhead couldn't be avoided in the construction of the Standard Library, the cost of the library's flexibility would drive away many of its intended users. Optimization techniques used in the Standard library are equally useful in your own code.

Empty Member Bloat

Here's an example. In the Standard C++ library each STL Container constructor takes, and copies, an allocator argument. Whenever the container needs storage, it asks its allocator member. In this way a user with some special memory to allocate (such as a shared memory region) can define an allocator for it, and pass that to a container constructor, so that container elements will be placed there. In the common case, however, the Standard default allocator is used, which just delegates to the global operator new. It is an empty object.

Here's simplified code for a possible implementation of these components. In Listing 1, the Standard default allocator, allocator<>, only has function members.

Listing 1: The Standard Default Allocator


  template <class T>
    class allocator {   // an empty class
      . . .
      static T* allocate(size_t n)
        { return (T*) ::operator new(n * sizeof T); }
      . . .
    };

In Listing 2, the generic list container template keeps a private allocator member, copied from its constructor argument.

Listing 2: The Standard Container list<>

  template <class T, class Alloc = allocator<T> >
    class list {
      . . .
      Alloc alloc_; 
      struct Node { . . . };
      Node* head_;      

     public:
      explicit list(Alloc const& a = Alloc())
        : alloc_(a) { . . . }
      . . .
    };

The list<> constructor is declared "explicit" so that the compiler will not use it as an automatic conversion. (This is a recent language feature.)
How a list<> gets storage for Node objects, from an allocator intended to supply storage for T objects, is a subject for an upcoming article.

The member list<>::alloc_ occupies, usually, four bytes in the object, even in the default case where Alloc is the empty class allocator<T>. A few extra bytes for the list object doesn't seem so bad until you imagine a big vector of these lists, as in a hash table. Any extra junk in each list is multiplied, and reduces the number of list headers that fit in a virtual memory page. Wasted space makes slower programs.

Empty Objects

How can this overhead be avoided? To see, you need to understand why the overhead is there. The Standard C++ language definition says:

A class with an empty sequence of members and base class objects is an empty class. Complete objects and member subobjects of an empty class type shall have nonzero size.

Why must objects with no member data occupy storage? Consider:


  struct Bar { };
  struct Foo {
    struct Bar a[2];
    struct Bar b;
  };
  Foo f;

What are the addresses of f.b and the elements of f.a[]? If sizeof(Bar) were zero, they might all have the same address. If you were keeping track of separate objects by their addresses, f.b and f.a[0] would appear to be the same object. The Standard committee chose to finesse these issues by forbidding zero-sized addressable objects.

Still, why does an empty member take up so much space (four bytes, in our example)? On all the compilers I know of, sizeof(Bar) is 1. However, on most architectures objects must be aligned according to their type. For example, if you declare


  struct Baz {
    Bar b;
    int* p;
  };

on most architectures today, sizeof(Baz) is 8. This is because the compiler adds "padding" so that member Baz::p doesn't cross a memory word boundary. (See Figure 1a.)

Figure 1a:


  struct Baz
  +-----------------------------------+
  | +-------+-------+-------+-------+ |
  | | Bar b | XXXXX | XXXXX | XXXXX | |
  | +-------+-------+-------+-------+ |
  | +-------------------------------+ |
  | | int* p                        | |
  | +-------------------------------+ |
  +-----------------------------------+

Figure 1b:


  struct Baz2
  +-----------------------------------+
  | +-------------------------------+ |
  | | int* p                        | |
  | +-------------------------------+ |
  +-----------------------------------+

How can you avoid this overhead? The Draft hints, in a footnote:

A base class subobject of an empty class type may have zero size.

In other words, if you declared Baz2 this way,


  struct Baz2 : Bar {
    int* p;
  };

then a compiler is allowed to reserve zero bytes for the empty base class Bar; hence, sizeof(Baz2) can be just 4 on most architectures. (See Figure 1b.)

Compiler implementers are not _required_ to do this optimization, and many don't, yet. However, you can expect that most Standard-conforming compilers will, because the efficiency of so many components of the Standard library (not only the containers) depends on it.

Eliminating Bloat

You have found the principle you need to eliminate the space overhead. How can you apply it? Let's consider how you might fix the implementation of the the example, template list<>. You could just derive from the allocator, as in Listing 3,

Listing 3: A Naïve Way to Eliminate Bloat


  template <class T, class Alloc = allocator<T> >
    class list : private Alloc {
      . . .
      struct Node { . . . };
      Node* head_;      

     public:
      explicit list(Alloc const& a = Alloc())
        : Alloc(a) { . . . }
      . . .
    };

and it would work, mostly. Code in the list<> member functions would get storage by calling "this->allocate()" instead of "alloc_.allocate()". However, the Alloc type supplied by the user is allowed to have virtual members, and these could conflict with a derived list<> member. (To see this, imagine a private member void list<>::init(), and a virtual member bool Alloc::init().)

A much better approach is to package the allocator with a list<> data member, such as the pointer to the first list node (as in Listing 4), so that the allocator's interface cannot leak out.

Listing 4: A Better Way to Eliminate Bloat


  template <class T, class Alloc = allocator<T> >
    class list {
      . . .
      struct Node { . . . };
      struct P : public Alloc {
        P(Alloc const& a) : Alloc(a), p(0) { }
        Node* p;
      };
      P head_;
      
     public:
      explicit list(Alloc const& a = Alloc())
        : head_(a) { . . . }
      . . .
    };

Now, list<> members get storage by saying "head_.allocate()", and mention the first list element by saying "head_.p". This works perfectly, there's no unnecessary overhead of any kind, and users of list<> can't tell the difference. Like any good optimization, it makes the implementation a bit messier, but doesn't affect the interface.

A Packaged Solution

There is still room for improvement. As usual, the improvement involves a template. In Listing 5 we've packaged the technique so that it is easy and clean to use.

Listing 5: Packaging the Technique


  template <class Base, class Member>
    struct BaseOpt : Base {
      Member m;
      BaseOpt(Base const& b, Member const& mem) 
        : Base(b), m(mem) { }
    };

Using this template, our list<> declaration appears as in Listing 6.

Listing 6: The Best Way to Eliminate Bloat


  template <class T, class Alloc = allocator<T> >
    class list {
      . . .
      struct Node { . . . };
      BaseOpt<Alloc,Node*> head_;
      
     public:
      explicit list(Alloc const& a = Alloc())
        : head_(a,0) { . . . }
      . . .
    };

This declaration of list<> is no bigger or messier than the unoptimized version we started with. Any other library component (Standard or otherwise) can use BaseOpt<> just as easily. The member code is only slightly messier; while it's not immediately obvious what's going on, the declaration of BaseOpt<> provides a natural place to document the technique and the reasons for it.

It is tempting to add members to BaseOpt<>, but that would not improve it: they could conflict with members inherited from the Base parameter, just as in Listing 3.

Finally

This technique can be used now, under any compiler with sturdy template support. Not all C++ compilers support the empty-base optimization yet, though the Sun, HP, IBM, and Microsoft compilers do, but the technique costs nothing extra even on those that don't. When you get a Standard-conforming compiler, it probably will do the optimization, and if your code uses this technique, it will become more efficient automatically.

Fergus Henderson contributed an essential refinement to this technique.

Update: the latest Borland compiler is able to do the optimization, controlled by a mode switch; it defaults off in the current version. Watcom's and Symantec's compilers do the optimization. The Metaware compiler does it when running under OS/2. (Other compilers that do the optimization will be listed here, as I get notified.)

A Watcom engineer reports that STL benchmarks ran 30% faster after they implemented the empty-base optimization.

Nathan has worked on the ISO/ANSI C++ Standard since 1993. He designed most of what is in Chapter 22 of the Draft Standard. His interests include library interface design, low level code efficiency, and Free software.

Return to the Cantrip Corpus. Email remarks about this page to ncm@cantrip.org.