c++ - Why am I observing multiple inheritance to be faster than single?

Friday 20 October 2017

c++ - Why am I observing multiple inheritance to be faster than single?

itemprop="text">

I have the following two files :-

single.cpp
:-

#include
            
#include
            


using namespace
            std;

unsigned long a=0;

class A {

            public:
 virtual int f() __attribute__ ((noinline)) { return a; }
            
};


class B : public A { 
 public:
            
 virtual int f() __attribute__ ((noinline)) { return a; } 
 void
            g() __attribute__ ((noinline)) { return; } 
}; 

int
            main() { 
 cin>>a; 
 A* obj; 
 if (a>3)
            

 obj = new B();
 else
 obj = new A();
            

 unsigned long result=0; 

 for (int i=0;
            i<65535; i++) { 
 for (int j=0; j<65535; j++) { 

            result+=obj->f(); 
 } 

 } 


            cout<            
}

And

multiple.cpp :-

#include
            
#include 

using
            namespace std;

unsigned long a=0;

class A
            {
 public:
 virtual int f() __attribute__ ((noinline)) { return a;
            }

};

class dummy {

            public:
 virtual void g() __attribute__ ((noinline)) { return;
            }
};

class B : public A, public dummy {

            public:
 virtual int f() __attribute__ ((noinline)) { return a;
            }

 virtual void g() __attribute__ ((noinline)) { return;
            }
};


int main() {

            cin>>a;
 A* obj;
 if (a>3)
 obj = new
            B();
 else

 obj = new A();

 unsigned
            long result=0;

 for (int i=0; i<65535; i++) {
 for
            (int j=0; j<65535; j++) {
 result+=obj->f();
 }

            }



            cout<}

I
am using gcc version 3.4.6 with flags -O2

And
this is the timings results I get :-

multiple
:-

real
            0m8.635s
user 0m8.608s
sys
            0m0.003s

single :-

real 0m10.072s
user
            0m10.045s
sys
            0m0.001s

On
the other hand, if in multiple.cpp I invert the order of class derivation thus :-

class B : public dummy, public A
            {

Then I get the
following timings (which is slightly slower than that for single inheritance as one
might expect thanks to 'thunk' adjustments to the this pointer that the code would need
to do) :-

real
            0m11.516s

user 0m11.479s
sys
            0m0.002s

Any idea why
this may be happening? There doesn't seem to be any difference in the assembly generated
for all three cases as far as the loop is concerned. Is there some other place that I
need to look at?

Also, I have bound the process
to a specific cpu core and I am running it on a real-time priority with SCHED_RR.

EDIT:- This was noticed by Mysticial and
reproduced by me.
Doing a

cout << "vtable:
            " << *(void**)obj <<
            endl;

just before the
loop in single.cpp leads to single also being as fast as multiple clocking in at 8.4 s
just like public A, public dummy.

class="post-text" itemprop="text">

class="normal">Answer

I think I
got at least some further lead on why this may be happening. The assembly for the loops
is exactly identical but the object files are not!

For the loop with the cout at first (i.e.)

cout << "vtable:
            " << *(void**)obj << endl;

for (int i=0; i<65535;
            i++) {
 for (int j=0; j<65535; j++) {

            result+=obj->f();

            }
}

I get
the following in the object file :-

40092d: bb fe ff 00 00
            mov $0xfffe,%ebx 
400932: 48 8b 45 00 mov 0x0(%rbp),%rax 
400936: 48
            89 ef mov %rbp,%rdi 
400939: ff 10 callq *(%rax) 
40093b: 48 98 cltq
            
40093d: 49 01 c4 add %rax,%r12 
400940: ff cb dec %ebx
            
400942: 79 ee jns 400932  
400944: 41 ff c5 inc
            %r13d 

400947: 41 81 fd fe ff 00 00 cmp $0xfffe,%r13d
            
40094e: 7e dd jle 40092d

However, without the
cout, the loops become :- (.cpp
first)

for (int i=0; i<65535;
            i++) {
 for (int j=0; j<65535; j++) {

            result+=obj->f();

            }

}

Now,
.obj :-

400a54: bb fe ff 00 00
            mov $0xfffe,%ebx
400a59: 66 data16 
400a5a: 66 data16
            
400a5b: 66 data16 
400a5c: 90 nop 

400a5d: 66
            data16 
400a5e: 66 data16 
400a5f: 90 nop 
400a60: 48 8b
            45 00 mov 0x0(%rbp),%rax 
400a64: 48 89 ef mov %rbp,%rdi 
400a67: ff
            10 callq *(%rax)
400a69: 48 98 cltq 
400a6b: 49 01 c4 add %rax,%r12
            
400a6e: ff cb dec %ebx 
400a70: 79 ee jns 400a60 
            

400a72: 41 ff c5 inc %r13d 
400a75: 41 81 fd fe ff 00 00
            cmp $0xfffe,%r13d
400a7c: 7e d6 jle 400a54

So I'd have to say
it's not really due to false aliasing as Mysticial points out but simply due to these
NOPs that the compiler/linker is emitting.

The
assembly in both cases is :-

.L30:

 movl
            $65534, %ebx
 .p2align 4,,7 
.L29:
 movq (%rbp), %rax
            
 movq %rbp, %rdi
 call *(%rax)
 cltq 
 addq
            %rax, %r12 
 decl %ebx
 jns .L29

 incl %r13d
            
 cmpl $65534, %r13d
 jle
            .L30

Now, .p2align
4,,7 will insert data/NOPs until the instruction counter for the next instruction has
the last four bits 0's for a maximum of 7 NOPs. Now the address of the instruction just
after p2align in the case without cout and before padding would
be

0x400a59 =
            0b101001011001

And
since it takes <=7 NOPs to align the next instruction, it will in fact do so in the
object file.

On the other hand, for the case
with the cout, the instruction just after .p2align lands up at

0x400932 =
            0b100100110010

and it
would take > 7 NOPs to pad it to a divisible by 16 boundary. Hence, it doesn't do
that.

So the extra time taken is simply due to
the NOPs that the compiler pads the code with (for better cache alignment) when
compiling with the -O2 flag and not really due to false aliasing.

I think this resolves the issue. I
am using rel="noreferrer">http://sourceware.org/binutils/docs/as/P2align.html
as
my reference for what .p2align actually does.

Blog

Friday 20 October 2017

c++ - Why am I observing multiple inheritance to be faster than single?

No comments:

Post a Comment

php - file_get_contents shows unexpected output while reading a file