Friday 20 October 2017

c++ - Why am I observing multiple inheritance to be faster than single?

itemprop="text">

I have the following two files :-



single.cpp
:-



#include

#include



using namespace
std;

unsigned long a=0;

class A {

public:
virtual int f() __attribute__ ((noinline)) { return a; }

};


class B : public A {
public:

virtual int f() __attribute__ ((noinline)) { return a; }
void
g() __attribute__ ((noinline)) { return; }
};

int
main() {
cin>>a;
A* obj;
if (a>3)


obj = new B();
else
obj = new A();


unsigned long result=0;

for (int i=0;
i<65535; i++) {
for (int j=0; j<65535; j++) {

result+=obj->f();
}

}


cout<
}


And



multiple.cpp :-




#include

#include

using
namespace std;

unsigned long a=0;

class A
{
public:
virtual int f() __attribute__ ((noinline)) { return a;
}

};

class dummy {

public:
virtual void g() __attribute__ ((noinline)) { return;
}
};

class B : public A, public dummy {

public:
virtual int f() __attribute__ ((noinline)) { return a;
}

virtual void g() __attribute__ ((noinline)) { return;
}
};


int main() {

cin>>a;
A* obj;
if (a>3)
obj = new
B();
else

obj = new A();

unsigned
long result=0;

for (int i=0; i<65535; i++) {
for
(int j=0; j<65535; j++) {
result+=obj->f();
}

}



cout<}


I
am using gcc version 3.4.6 with flags -O2



And
this is the timings results I get :-



multiple
:-




real
0m8.635s
user 0m8.608s
sys
0m0.003s


single :-



real 0m10.072s
user
0m10.045s
sys
0m0.001s



On
the other hand, if in multiple.cpp I invert the order of class derivation thus :-



class B : public dummy, public A
{


Then I get the
following timings (which is slightly slower than that for single inheritance as one
might expect thanks to 'thunk' adjustments to the this pointer that the code would need
to do) :-



real
0m11.516s

user 0m11.479s
sys
0m0.002s


Any idea why
this may be happening? There doesn't seem to be any difference in the assembly generated
for all three cases as far as the loop is concerned. Is there some other place that I
need to look at?



Also, I have bound the process
to a specific cpu core and I am running it on a real-time priority with SCHED_RR.



EDIT:- This was noticed by Mysticial and
reproduced by me.
Doing a




cout << "vtable:
" << *(void**)obj <<
endl;


just before the
loop in single.cpp leads to single also being as fast as multiple clocking in at 8.4 s
just like public A, public dummy.


class="post-text" itemprop="text">
class="normal">Answer



I think I
got at least some further lead on why this may be happening. The assembly for the loops
is exactly identical but the object files are not!



For the loop with the cout at first (i.e.)




cout << "vtable:
" << *(void**)obj << endl;

for (int i=0; i<65535;
i++) {
for (int j=0; j<65535; j++) {

result+=obj->f();

}
}


I get
the following in the object file :-




40092d: bb fe ff 00 00
mov $0xfffe,%ebx
400932: 48 8b 45 00 mov 0x0(%rbp),%rax
400936: 48
89 ef mov %rbp,%rdi
400939: ff 10 callq *(%rax)
40093b: 48 98 cltq

40093d: 49 01 c4 add %rax,%r12
400940: ff cb dec %ebx

400942: 79 ee jns 400932
400944: 41 ff c5 inc
%r13d

400947: 41 81 fd fe ff 00 00 cmp $0xfffe,%r13d

40094e: 7e dd jle 40092d



However, without the
cout, the loops become :- (.cpp
first)



for (int i=0; i<65535;
i++) {
for (int j=0; j<65535; j++) {

result+=obj->f();

}

}


Now,
.obj :-



400a54: bb fe ff 00 00
mov $0xfffe,%ebx
400a59: 66 data16
400a5a: 66 data16

400a5b: 66 data16
400a5c: 90 nop

400a5d: 66
data16
400a5e: 66 data16
400a5f: 90 nop
400a60: 48 8b
45 00 mov 0x0(%rbp),%rax
400a64: 48 89 ef mov %rbp,%rdi
400a67: ff
10 callq *(%rax)
400a69: 48 98 cltq
400a6b: 49 01 c4 add %rax,%r12

400a6e: ff cb dec %ebx
400a70: 79 ee jns 400a60


400a72: 41 ff c5 inc %r13d
400a75: 41 81 fd fe ff 00 00
cmp $0xfffe,%r13d
400a7c: 7e d6 jle 400a54



So I'd have to say
it's not really due to false aliasing as Mysticial points out but simply due to these
NOPs that the compiler/linker is emitting.



The
assembly in both cases is :-



.L30:

movl
$65534, %ebx
.p2align 4,,7
.L29:
movq (%rbp), %rax

movq %rbp, %rdi
call *(%rax)
cltq
addq
%rax, %r12
decl %ebx
jns .L29

incl %r13d

cmpl $65534, %r13d
jle
.L30


Now, .p2align
4,,7 will insert data/NOPs until the instruction counter for the next instruction has
the last four bits 0's for a maximum of 7 NOPs. Now the address of the instruction just
after p2align in the case without cout and before padding would
be



0x400a59 =
0b101001011001



And
since it takes <=7 NOPs to align the next instruction, it will in fact do so in the
object file.



On the other hand, for the case
with the cout, the instruction just after .p2align lands up at



0x400932 =
0b100100110010


and it
would take > 7 NOPs to pad it to a divisible by 16 boundary. Hence, it doesn't do
that.



So the extra time taken is simply due to
the NOPs that the compiler pads the code with (for better cache alignment) when
compiling with the -O2 flag and not really due to false aliasing.




I think this resolves the issue. I
am using rel="noreferrer">http://sourceware.org/binutils/docs/as/P2align.html
as
my reference for what .p2align actually does.



No comments:

Post a Comment

php - file_get_contents shows unexpected output while reading a file

I want to output an inline jpg image as a base64 encoded string, however when I do this : $contents = file_get_contents($filename); print &q...