C++11's Move Semantics Are Not Free

Overview

I was taking a break on StackOverflow and discovered this question about expression templates (ETs):

I was wondering whether lambdas together with move semantics or any other new feature can do as good as ETs. Any thoughts?

So I answered the question: no, expression templates are still needed! This blog post is a rewritten version of my response on StackOverflow.

C++11 Move Operations Are Not Free

If you are new to C++11 think of move semantics as optimized copy operations. Never think of them as operations with zero overhead because they don't have zero overhead. In the worst case, moving something has the same complexity as copying: i.e., O(moving data) = O(copying data). In the best case (for classes/structs with data members), the cost of moving is at least the cost of copying pointers to internal data structures: i.e., Ω(moving data) = Ω(copying pointers to internal data structures). In no case, except when the class/struct has no data members, will the cost of moving something be zero: i.e., Ω(moving data) ≠ 0 just as Ω(copying data) ≠ 0 under the same circumstances.

If it ever appears that the cost of moving or copying data in C++ is zero, it is because of the copy elision optimization and the return value optimization (RVO) are being performed by the C++ compiler –or in a very niche set of cases your classes/structs have no member variables.

The example presented below and in the StackOverflow answer I made shows why the costs of moves needs to be considered. In the example, a math_vector class using an statically-sized std::array as its internal representation. Thus, the cost of moving a math_vector object is the same as copying it and so avoiding any extra copies or moves involving temporaries is of utmost importance. However, I don't stop there as the next question one would likely ask is: how does one get rid of the extra copies and or moves? The method used to accomplish this was the C++ template metaprogramming technique called expression templates.

The Objective and the Problem

When code is written, the goal usually sought after is to have maximum efficiency while keeping the code understandable and maintainable by humans. Object-oriented and object-based programming can help achieve this goal as it groups the operations that can be performed on types together. This leads to code that looks like:

{
  cout << "CASE 1:\n";
  math_vector<3> a{1.0, 1.1, 1.2};
  math_vector<3> b{2.0, 2.1, 2.2};
  math_vector<3> c{3.0, 3.1, 3.2};
  math_vector<3> d{4.0, 4.1, 4.2};
  math_vector<3> result = a + b + c + d;
  cout << '[' << &result << "]: " << result << "\n";
}
cout << endl;
{
  cout << "CASE 2:\n";
  math_vector<3> result =
    math_vector<3>{1.0, 1.1, 1.2} +
    math_vector<3>{2.0, 2.1, 2.2} +
    math_vector<3>{3.0, 3.1, 3.2} +
    math_vector<3>{4.0, 4.1, 4.2}
  ;
  cout << '[' << &result << "]: " << result << "\n";
}

where: math_vector<3> is represents a (mathematical) vector object that has 3 dimensions representing internally by a statically-sized std::array<long double, 3>

When compiled with C++11 move-aware code but without using any expression template coding techniques, the above code outputs:

CASE 1:
0x7ffff9902980: math_vector(initlist)
0x7ffff99029b0: math_vector(initlist)
0x7ffff99029e0: math_vector(initlist)
0x7ffff9902a10: math_vector(initlist)
0x7ffff9902ae0: math_vector(copy: 0x7ffff9902980)
0x7ffff9902ab0: math_vector(move: 0x7ffff9902ae0)
0x7ffff9902a40: math_vector(move: 0x7ffff9902ab0)
0x7ffff9902ab0: ~math_vector()
0x7ffff9902ae0: ~math_vector()
[0x7ffff9902a40]: (10,10.4,10.8)
0x7ffff9902a40: ~math_vector()
0x7ffff9902a10: ~math_vector()
0x7ffff99029e0: ~math_vector()
0x7ffff99029b0: ~math_vector()
0x7ffff9902980: ~math_vector()

CASE 2:
0x7ffff9902b10: math_vector(initlist)
0x7ffff9902b80: math_vector(initlist)
0x7ffff9902bf0: math_vector(initlist)
0x7ffff9902c30: math_vector(initlist)
0x7ffff9902bc0: math_vector(move: 0x7ffff9902c30)
0x7ffff9902b50: math_vector(move: 0x7ffff9902bc0)
0x7ffff9902a40: math_vector(move: 0x7ffff9902b50)
0x7ffff9902b50: ~math_vector()
0x7ffff9902bc0: ~math_vector()
0x7ffff9902c30: ~math_vector()
0x7ffff9902bf0: ~math_vector()
0x7ffff9902b80: ~math_vector()
0x7ffff9902b10: ~math_vector()
[0x7ffff9902a40]: (10,10.4,10.8)
0x7ffff9902a40: ~math_vector()

i.e., CASE 1 requires four explicit constructed instances using initializer lists (i.e., the "initlist" items), the "result" variable (i.e., `0x7fff8d6eddd0`), and an additional three objects for copying and moving; CASE 2 is better only requiring three extra objects for moving.

But one can do better: zero extra temporaries should be possible with the one caveat: all explicitly instantiated types would still be created (i.e., the four "initlist" constructors and "result"). This will be achieved using the expressions templates technique by creating:

  • a proxy math_vector_expr<leftexpr ,BinaryOp,RightExpr> class to hold an expression not computed yet,
  • a proxy plus_op class was created to hold the addition operation not computed yet,
  • a constructor added to math_vector to accept a math_vector_expr object,
  • "starter" member functions in math_vector to trigger the creation of the expression template (e.g., operator + and operator +=), and,
  • "end" member functions in math_vector to force the computation to occur (e.g., by assigning a math_vector_expr to a math_vector and by passing a math_vector_expr to a math_vector constructor).

Key is realizing that care was taken to ensure that all function arguments are perfectly forwarded and all return values are moved with the proxy objects and all code invoked from the use of such. Additionally, code was written to ensure that copies would not be needlessly be made by writing overloads for rvalue references –not just constant references– and moving returned values when they are lvalues referring to rvalue references.

If you want or need to more about move semantics, lvalues, and rvalues see my series that starting with Applying std::tuple to Functors Efficiently.

The results using expression templates are wonderful: no extra temporaries are created in CASE 1 or CASE 2. This is demonstrated by the following output:

CASE 1:
0x7fff4d2ed820: math_vector(initlist)
0x7fff4d2ed850: math_vector(initlist)
0x7fff4d2ed880: math_vector(initlist)
0x7fff4d2ed8b0: math_vector(initlist)
0x7fff4d2ed8e0: math_vector(expr: 0x7fff4d2ed950)
[0x7fff4d2ed8e0]: (10,10.4,10.8)
0x7fff4d2ed8e0: ~math_vector()
0x7fff4d2ed8b0: ~math_vector()
0x7fff4d2ed880: ~math_vector()
0x7fff4d2ed850: ~math_vector()
0x7fff4d2ed820: ~math_vector()

CASE 2:
0x7fff4d2ed990: math_vector(initlist)
0x7fff4d2ed9e0: math_vector(initlist)
0x7fff4d2eda30: math_vector(initlist)
0x7fff4d2eda70: math_vector(initlist)
0x7fff4d2ed8e0: math_vector(expr: 0x7fff4d2ed980)
0x7fff4d2eda70: ~math_vector()
0x7fff4d2eda30: ~math_vector()
0x7fff4d2ed9e0: ~math_vector()
0x7fff4d2ed990: ~math_vector()
[0x7fff4d2ed8e0]: (10,10.4,10.8)
0x7fff4d2ed8e0: ~math_vector()

i.e., only five math_vector's are created in each case: the four "initlist" objects and the "result" variable. If one looks at the resulting assembler code the results are even better:

fldt  128(%rsp)
leaq  128(%rsp), %rdi
leaq  80(%rsp), %rbp
fldt  176(%rsp)
faddp %st, %st(1)
fldt  224(%rsp)
faddp %st, %st(1)
fldt  272(%rsp)
faddp %st, %st(1)
fstpt 80(%rsp)
fldt  144(%rsp)
fldt  192(%rsp)
faddp %st, %st(1)
fldt  240(%rsp)
faddp %st, %st(1)
fldt  288(%rsp)
faddp %st, %st(1)
fstpt 96(%rsp)
fldt  160(%rsp)
fldt  208(%rsp)
faddp %st, %st(1)
fldt  256(%rsp)
faddp %st, %st(1)
fldt  304(%rsp)
faddp %st, %st(1)
fstpt 112(%rsp)

which appears between the initialization of the four "initlist" variables and the output of the "result" data. The reason this is an "even better" result is that absolutely no function calls are done here: only CPU operations related to loading and adding floating point numbers. This is what one would do when writing hand-optimizing the code, e.g., unrolling loops, etc.

Expression Templates

Before showing the code below, a short bit concerning expression templates should be provided. Expression templates are a form of template metaprogramming. Template metaprogramming is a style of programming using C++ templates that has the compiler compute the code to generate at compile-time. This can be done because the C++ template mechanism is itself Turing-complete. Unlike "normal" programming it involves the use of types and/or constants.

Aside: Overall template metaprogramming is not easy, so if you are new to C++ wait until you are at least an intermediate-skilled C++ programmer before jumping into the C++ metaprogramming world as you will need to understand key subtle points of the language. Many of my other C++ posts on this site use template metaprogramming if you want to see some of the things that can be done and some of what it entails.

The expression template technique involves building parse trees on recognized expressions using user-defined types. Each type in these parse trees is a proxy object representing the values of what is to be computed at a later point in time. Thus, each proxy object stores its arguments and any operations to be performed without actually performing the operation until such absolutely needs to be done or otherwise desired.

In the code below, this is exactly what the math_vector_expr class does. Notice that it stores the "left expression" and the "right expression" for some "binary operator". Notice that both the left and right expressions are stored, but, the binary operation was not since plus_op class does not have any state in this example. If it did, then the binary operation would have also been stored.

The actual construction of the parse tree is triggered from operator + code in the math_vector class since it returns a math_vector_expr. Notice that the math_vector_expr class doesn't ever compute anything –it merely stores its arguments and returns another math_vector_expr. Key here is to have all math_vector_expr instances rvalues or at least constant references to exploit compiler optimizations and avoid generating code with run-time overhead. Finally, notice that only by passing a math_vector_expr to the math_vector constructor or by assigning one to a math_veector will the math_vector_expr finally be asked to compute the values belonging to each vector element. When coupled with C++11 compiler optimizations the end result is the elision of all compiler temporary variables and total code efficiency. It is totally and very cool!

C++ Code

The following C++ code is the example code used to produce all output above. Be sure to compile the code with at least level one optimizations. This code will work with both clang++ v3.1 and g++ v4.8 (and likely v4.7). I compiled the code using these options: -std=c++11 -O3.

If you want to build the code so that it does not use expression templates, then #define DONT_USE_EXPR_TEMPL. If not, then ensure the macro is not defined.

#include <array>
#include <algorithm>
#include <initializer_list>
#include <type_traits>
#include <iostream>

//#define DONT_USE_EXPR_TEMPL

//=====================================================================

template <std::size_t N> class math_vector;

template <
  typename LeftExpr,
  typename BinaryOp,
  typename RightExpr
>
class math_vector_expr
{
  public:
    math_vector_expr() = delete;

    math_vector_expr(LeftExpr l, RightExpr r) :
      l_(std::forward<LeftExpr>(l)),
      r_(std::forward<RightExpr>(r))
    {
    }

    // Prohibit copying...
    math_vector_expr(math_vector_expr const&) = delete;
    math_vector_expr& operator =(math_vector_expr const&) = delete;

    // Allow moves...
    math_vector_expr(math_vector_expr&&) = default;
    math_vector_expr& operator =(math_vector_expr&&) = default;

    template <typename RE>
    auto operator +(RE&& re) const ->
      math_vector_expr<
        math_vector_expr<LeftExpr,BinaryOp,RightExpr> const&,
        BinaryOp,
        decltype(std::forward<RE>(re))
      >
    {
      return
        math_vector_expr<
          math_vector_expr<LeftExpr,BinaryOp,RightExpr> const&,
          BinaryOp,
          decltype(std::forward<RE>(re))
        >(*this, std::forward<RE>(re))
      ;
    }

    auto le() ->
      typename std::add_lvalue_reference<LeftExpr>::type
      { return l_; }

    auto le() const ->
      typename std::add_lvalue_reference<
        typename std::add_const<LeftExpr>::type
      >::type
      { return l_; }

    auto re() ->
      typename std::add_lvalue_reference<RightExpr>::type
      { return r_; }

    auto re() const ->
      typename std::add_lvalue_reference<
        typename std::add_const<RightExpr>::type
      >::type
      { return r_; }

    auto operator [](std::size_t index) const ->
      decltype(
        BinaryOp::apply(this->le()[index], this->re()[index])
      )
    {
      return BinaryOp::apply(le()[index], re()[index]);
    }

  private:
    LeftExpr l_;
    RightExpr r_;
};

//=====================================================================

template <typename T>
struct plus_op
{
  static T apply(T const& a, T const& b)
  {
    return a + b;
  }

  static T apply(T&& a, T const& b)
  {
    a += b;
    return std::move(a);
  }

  static T apply(T const& a, T&& b)
  {
    b += a;
    return std::move(b);
  }

  static T apply(T&& a, T&& b)
  {
    a += b;
    return std::move(a);
  }
};

//=====================================================================

template <std::size_t N>
class math_vector
{
  using impl_type = std::array<long double, N>;

  public:
    math_vector()
    {
      using namespace std;
      fill(begin(v_), end(v_), impl_type{});
      std::cout << this << ": math_vector()" << endl;
    }

    math_vector(math_vector const& mv) noexcept
    {
      using namespace std;
      copy(begin(mv.v_), end(mv.v_), begin(v_));
      std::cout << this << ": math_vector(copy: " << &mv << ")" << endl;
    }

    math_vector(math_vector&& mv) noexcept
    {
      using namespace std;
      move(begin(mv.v_), end(mv.v_), begin(v_));
      std::cout << this << ": math_vector(move: " << &mv << ")" << endl;
    }

    math_vector(std::initializer_list<typename impl_type::value_type> l)
    {
      using namespace std;
      copy(begin(l), end(l), begin(v_));
      std::cout << this << ": math_vector(initlist)" << endl;
    }

    math_vector& operator =(math_vector const& mv) noexcept
    {
      using namespace std;
      copy(begin(mv.v_), end(mv.v_), begin(v_));
      std::cout << this << ": math_vector op =(copy: " << &mv << ")" << endl;
      return *this;
    }

    math_vector& operator =(math_vector&& mv) noexcept
    {
      using namespace std;
      move(begin(mv.v_), end(mv.v_), begin(v_));
      std::cout << this << ": math_vector op =(move: " << &mv << ")" << endl;
      return *this;
    }

    ~math_vector()
    {
      using namespace std;
      std::cout << this << ": ~math_vector()" << endl;
    }

    void swap(math_vector& mv)
    {
      using namespace std;
      for (std::size_t i = 0; i<N; ++i)
        swap(v_[i], mv[i]);
    }

    auto operator [](std::size_t index) const
      -> typename impl_type::value_type const&
    {
      return v_[index];
    }

    auto operator [](std::size_t index)
      -> typename impl_type::value_type&
    {
      return v_[index];
    }

    math_vector& operator +=(math_vector const& b)
    {
      for (std::size_t i = 0; i<N; ++i)
        v_[i] += b[i];
      return *this;
    }

  #ifndef DONT_USE_EXPR_TEMPL

    template <typename LE, typename Op, typename RE>
    math_vector(math_vector_expr<LE,Op,RE>&& mve)
    {
      for (std::size_t i = 0; i < N; ++i)
        v_[i] = mve[i];
      std::cout << this << ": math_vector(expr: " << &mve << ")" << std::endl;
    }

    template <typename RightExpr>
    math_vector& operator =(RightExpr&& re)
    {
      for (std::size_t i = 0; i<N; ++i)
        v_[i] = re[i];
      return *this;
    }

    template <typename RightExpr>
    math_vector& operator +=(RightExpr&& re)
    {
      for (std::size_t i = 0; i<N; ++i)
        v_[i] += re[i];
      return *this;
    }

    template <typename RightExpr>
    auto operator +(RightExpr&& re) const ->
      math_vector_expr<
        math_vector const&,
        plus_op<typename impl_type::value_type>,
        decltype(std::forward<RightExpr>(re))
      >
    {
      return
        math_vector_expr<
          math_vector const&,
          plus_op<typename impl_type::value_type>,
          decltype(std::forward<RightExpr>(re))
        >(
          *this,
          std::forward<RightExpr>(re)
        )
      ;
    }
  #endif // #ifndef DONT_USE_EXPR_TEMPL

  private:
    impl_type v_;
};

//=====================================================================

template <std::size_t N>
inline void swap(math_vector<N>& a, math_vector<N>& b)
{
  a.swap(b);
}

//=====================================================================

#ifdef DONT_USE_EXPR_TEMPL

template <std::size_t N>
inline math_vector<N> operator +(
  math_vector<N> const& a,
  math_vector<N> const& b
)
{
  math_vector<N> retval(a);
  retval += b;
  return retval;
}

template <std::size_t N>
inline math_vector<N> operator +(
  math_vector<N>&& a,
  math_vector<N> const& b
)
{
  a += b;
  return std::move(a);
}

template <std::size_t N>
inline math_vector<N> operator +(
  math_vector<N> const& a,
  math_vector<N>&& b
)
{
  b += a;
  return std::move(b);
}

template <std::size_t N>
inline math_vector<N> operator +(
  math_vector<N>&& a,
  math_vector<N>&& b
)
{
  a += std::move(b);
  return std::move(a);
}

#endif // #ifdef DONT_USE_EXPR_TEMPL

//=====================================================================

template <std::size_t N>
std::ostream& operator <<(std::ostream& os, math_vector<N> const& mv)
{
  os << '(';
  for (std::size_t i = 0; i < N; ++i)
    os << mv[i] << ((i+1 != N) ? ',' : ')');
  return os;
}

//=====================================================================

int main()
{
  using namespace std;

  try
  {
    {
      cout << "CASE 1:\n";
      math_vector<3> a{1.0, 1.1, 1.2};
      math_vector<3> b{2.0, 2.1, 2.2};
      math_vector<3> c{3.0, 3.1, 3.2};
      math_vector<3> d{4.0, 4.1, 4.2};
      math_vector<3> result = a + b + c + d;
      cout << '[' << &result << "]: " << result << "\n";
    }
    cout << endl;
    {
      cout << "CASE 2:\n";
      math_vector<3> result =
        math_vector<3>{1.0, 1.1, 1.2} +
        math_vector<3>{2.0, 2.1, 2.2} +
        math_vector<3>{3.0, 3.1, 3.2} +
        math_vector<3>{4.0, 4.1, 4.2}
      ;
      cout << '[' << &result << "]: " << result << "\n";
    }
  }
  catch (...)
  {
    return 1;
  }
}

//=====================================================================

Leave a Reply

Your email address will not be published. Required fields are marked *