really slow ALLOCATE / MOVE

Discussion:

really slow ALLOCATE / MOVE_ALLOC

Thomas Breuel

2008-11-06 15:50:38 UTC

Anyway, my point is that if some other allocation has used the space behind
the object you want to resize, then realloc() has to do a new allocation and
copy the data over. Unless, as I mentioned previously, the allocation is big
enough that malloc has switched from (s)brk to mmap.

Good realloc implementations will avoid copies for all sizes larger
than a single page, not just very large arrays.

I think people simply don't appreciate how big the difference between
realloc and MOVE_ALLOC really is, and how inefficient dynamic storage
management in GFortran can be.

Here are two simple programs that extend arrays one integer at a time.
The C version scales linearly, the Fortran version is quadratic. For
1000000 elements, the C version takes 39 milliseconds, the Fortran
version takes 18 minutes. The C version scales up easily to 100000000
elements (1.2 seconds). The C version of realloc is so efficient
that, except for inner loops, you simply just call it whenever it
makes sense. In gfortran, calling ALLOCATE seems to be so costly that
one really has to think very carefully about where to use it and adopt
error prone and costly pre-allocation strategies to work around these
inefficiencies.

Of course, you can generate a list of the first 10000000 integers much
more easily. But analogous allocation patterns do exist in many
real-world compute-intensive programs. GFortran performs poorly on
these benchmarks means that it will perform poorly on those real-world
tasks as well. There are some workarounds (defining data structures
with exponential doubling), but they are more cumbersome and still
less efficient than just doing the simple and efficient reallocation
(and fixing whatever other overhead there may be).

Tom

PS: Note that I would have liked to write the realloc_fortran.f95
using "a = [a,i]", but that crashes in the version of 4.3 that I have.

$ cat realloc_c.c
#include <stdlib.h>
#include <stdio.h>
#include <malloc.h>

int main(int argc,char **argv) {
if(argc<2) {printf("bad args\n"); exit(1);}
int n = atoi(argv[1]);
int *data = malloc(sizeof data[0]);
for(int i=1;i<n;i++) {
data = realloc(data,(i+1)*sizeof data[0]);
if(!data) abort();
data[i] = i;
}
}
$ cat realloc_fortran.f95
program ra
integer, allocatable :: a(:),temp(:)
integer n,i
character*100 buf
call getarg(1,buf)
read(buf,*) n
allocate(a(1))
do i=2,n
allocate(temp(i))
temp(1:i-1) = a
call move_alloc(temp,a)
a(i) = i
end do
print *,sum(a)
end program
$ gcc --std=c99 -O3 realloc_c.c -o realloc_c
$ gfortran -O3 -pedantic --std=f95 realloc_fortran.f95 -o realloc_fortran
$ time realloc_c 1000000

real 0m0.039s
user 0m0.016s
sys 0m0.024s
$ time realloc_fortran 1000000
1784293662

real 18m20.474s
user 10m7.414s
sys 8m12.463s
$

If you don't like the above example, here's another one that simulates
reallocations in a long-running dynamic program (and if you still
think that's unrealistic, replace int *data with int *data[100]).
And, yes, this does access uninitialized memory.

$ cat rc.c
#include <stdlib.h>
#include <stdio.h>
#include <malloc.h>
#include <math.h>

int main(int argc,char **argv) {
int *data = malloc(sizeof data[0]);
int i,size;
double total;
for(i=0;i<100000;i++) {
size = exp(1.0+15*drand48());
data = realloc(data,size * sizeof data[0]);
data[lrand48()%size] = lrand48();
}
for(i=0;i<size;i++) total += data[i];
printf("%g\n",total);
}
$ cat rf.f95
program ra
integer, allocatable :: a(:),temp(:)
integer n,k,i
real rv(3)
allocate(a(1))
do i=1,100000
call random_number(rv)
n = floor(exp(1.0+15.0*rv(1)))
allocate(temp(n))
k = min(n,size(a))
temp(1:k) = a(1:k)
call move_alloc(temp,a)
a(1+floor(rv(2)*n)) = floor(rv(3)*n)
end do
print *,sum(a)
end program
$ gfortran -O3 rf.f95
$ time a.out
176436294

real 0m14.984s
user 0m14.233s
sys 0m0.752s
$ gcc -O3 rc.c -lm
$ time a.out

real 0m0.243s
user 0m0.028s
sys 0m0.216s

Steve Kargl

2008-11-06 18:54:28 UTC

Permalink

Post by Thomas Breuel

I think the people actively working on improving gfortran
recognize there may be suboptimal memory management. There
is, however, only so much that the 4 or 5 volunteers (with
work and family commitments) are capable of doing. These
individuals probably place more importance on fixing real
bugs (from the gfortran wiki):

* PRs where a valid Fortran program is not accepted (internal
compiler error, wrongly rejected), wrong code is produced or
where the compile time or memory is excessively high -- 51 bugs
(10 assigned)

* PRs where invalid code is accepted or gives an internal compiler
error 40 bugs (3 assigned)

Feel free to submit a patch to use realloc.

--
Steve

Thomas Breuel

2008-11-06 19:32:20 UTC

Permalink

Post by Steve Kargl

Post by Thomas Breuel
Good realloc implementations will avoid copies for all sizes larger
than a single page, not just very large arrays.
I think people simply don't appreciate how big the difference between
realloc and MOVE_ALLOC really is, and how inefficient dynamic storage
management in GFortran can be.

I think the people actively working on improving gfortran
recognize there may be suboptimal memory management.

No problem; I'm not complaining about your efforts. Tobias asked what
my priorities were. I explained them. Responses from you and others
suggested that you didn't understand what I was getting at, so I
explained it and gave some examples and measurements.

Using realloc instead of malloc is often quite easy, but you need to
be aware of its importance; if you view it as a dodgy C hack that
maybe gains people a few percentage performance, you aren't going to
bother using it even when it's easy to do.

Post by Steve Kargl
Feel free to submit a patch to use realloc.

There is no single patch; you simply need to be aware of it importance
and use it when it makes sense. One place it should be fairly easy to
use is in "a = [a,i]" constructs, and I hope you can add it there as
part of the general improvements to ALLOCATE.

I'll probably provide a REALLOC subroutine at some point that uses
realloc internally and has a portable generic implementation in terms
of standard primitives.

Tom

Paul Richard Thomas

2008-11-06 19:45:54 UTC

Permalink

Thomas,

Post by Thomas Breuel
I think people simply don't appreciate how big the difference between
realloc and MOVE_ALLOC really is, and how inefficient dynamic storage
management in GFortran can be.

I do not believe this to be a feature of gfortran alone. One finds in
the Fortran 95 standard:

6.3.1.1 Allocation of allocatable arrays
An allocatable array that has been allocated by an ALLOCATE statement
and has not been
subsequently deallocated (6.3.3) is currently allocated and is
definable. Allocating a currently
allocated allocatable array causes an error condition in the ALLOCATE
statement. At the
beginning of execution of a program, allocatable arrays have the
allocation status of not currently
allocated and are not definable. The ALLOCATED intrinsic function
(13.14.9) may be used to
determine whether an allocatable array is currently allocated.

Thus, to be standard compliant, a fortran compiler must take the
performance hit that you talk about. In general, calls to 'malloc' or
'new' are very expensive in time; hence the use of realloc in C.
Ultimately, of course, the programmer is obliged, if all else fails,
to use the same strategy as realloc; allocate oversize and only
de-allocate/allocate when needed.

Octave, with which I have some slight familiarity, performs less well
than the commercial product precisely because new/free are so
expensive in time.

This will become and issue when we implement automatic allocation for
allocatable arrays, a la f2003.

Paul

Janne Blomqvist

2008-11-06 21:54:11 UTC

Permalink

Post by Paul Richard Thomas
In general, calls to 'malloc' or
'new' are very expensive in time; hence the use of realloc in C.
Ultimately, of course, the programmer is obliged, if all else fails,
to use the same strategy as realloc; allocate oversize and only
de-allocate/allocate when needed.

But this is not what realloc does. realloc takes a pointer allocated via
malloc(), and resizes the area pointed to by that pointer. In some cases
it's possible to do that just by enlarging the area (which is fast), in
other cases it's necessary to malloc() a new area, copy the old data
over and free() the old area (which is slower, and essentially the same
as the Fortran ALLOCATE temp array, copy, MOVE_ALLOC).

Post by Paul Richard Thomas
Octave, with which I have some slight familiarity, performs less well
than the commercial product precisely because new/free are so
expensive in time.

AFAIK much of this is due to the value semantics of the matlab language;
the commercial runtime has rather sophisticated COW (copy-on-write)
analysis in order to reduce the amount of needless copying.

Post by Paul Richard Thomas
This will become and issue when we implement automatic allocation for
allocatable arrays, a la f2003.

As I've mentioned previously in this thread, it should be possible to
optimize

a = [a,x]

into a realloc() call + copy x.

--
Janne Blomqvist

Thomas Breuel

2008-11-06 22:55:54 UTC

Permalink

Post by Janne Blomqvist
But this is not what realloc does. realloc takes a pointer allocated via
malloc(), and resizes the area pointed to by that pointer. In some cases
it's possible to do that just by enlarging the area (which is fast), in
other cases it's necessary to malloc() a new area, copy the old data over
and free() the old area (which is slower, and essentially the same as the
Fortran ALLOCATE temp array, copy, MOVE_ALLOC).

Actually, on many platforms, no data gets copied. If you try to
realloc a region larger than about a page and there is no room to
expand the region, realloc asks the operating system to remap the
pages making up the region somewhere else where the region can be
expanded. It's this extra functionality that makes realloc so
powerful.

Of course, not all platforms have the necessary hardware support, but
not all platforms have hardware floating point either.

Post by Janne Blomqvist
As I've mentioned previously in this thread, it should be possible to
optimize
a = [a,x]
into a realloc() call + copy x.

Yes, this would be great.

The other case, explicit reallocation of arrays, can be handled with
library routines.

Tom

Paul Richard Thomas

2008-11-10 11:34:09 UTC

Permalink

Janne,

Post by Paul Richard Thomas
Octave, with which I have some slight familiarity, performs less well
than the commercial product precisely because new/free are so
expensive in time.

AFAIK much of this is due to the value semantics of the matlab language; the
commercial runtime has rather sophisticated COW (copy-on-write) analysis in
order to reduce the amount of needless copying.

That's right - the correct comparison is between Matlab5 and octave.
Subsequent versions of Matlab use the just-in-time compiler that did
remarkable things for its performance. Matlab5 used some form of
internal memory management to obtain a few tens faster execution of
what you might call "fortran77" style code than octave. Octave uses
new/free throughout and gets badly hit by it.

Cheers

Paul

Thomas Breuel

2008-11-10 14:05:01 UTC

Permalink

Thanks for the response; looks like we're on the same page.

Post by Paul Richard Thomas

Post by Thomas Breuel
Good implementations of malloc use VM hardware and page remapping,
rather than reserve extra space. Look at the mremap system call on
Linux.

I understand this but reducing the number of system calls by
allocating oversize will certainly result in a performance
improvement.

Yes, but there needs to be some minimal user control. When I work
with 3.5Gbyte arrays on a 4Gbyte machine, I need to be able to tell
the library not to pre-allocate 1.5x the array size.

Post by Paul Richard Thomas
From working with arrays classes in C++, a simple scheme that lets the

user control absolute reserved space and tell the array class what to
do on resizing seems to work well. Based on that experience, I think
it would be possible to do this transparently in Fortran through a
collection of subroutines that pass hints to the allocator on
implementations that understand it (see below).

Fortran implementations that don't understand reserved storage can
provide implementations of realloc in terms of standard, portable
primitives of realloc, and do nothing for the other calls. Fortran
implementations that can perform efficient array reallocation through
a C realloc call or through direct VM manipulation internally make the
right tradeoffs between reserve allocation and VM-based resizing when
necessary.

A good library implementation for gfortran could perhaps encourage
incorporation of these features into a future standard.

Note that full automation of these just isn't possible; the compiler
simply doesn''t have the information from a single run of the program
to make the right choices.

Conformal versions of these primitives are rarely needed in my
experience, and I think it would be bad to make good
reservation/reallocation an all-or-nothing proposition. That's why I
think it's good to have a non-conformal realloc and a separate
conformal_realloc. Doing a good job at the non-conformal version is
fairly easy, and implementors could just do a minimal job on the
conformal_realloc (it would still be better and more convenient than
the current solutions).

Post by Paul Richard Thomas
True but a naive application with realloc would be a good start:-)

Yes, my point exactly :-)

Tom

Here's the interface. Again, the two simplest implementations for this are:

(1) each subroutine does nothing

(2) most of the subroutines do nothing, but realloc calls C's realloc

The next one would be:

(3) implement a simple storage reservation scheme (with some care,
this can actually be done in C library code with almost no compiler
modifications)

This kind of API would let programmers express their intent and needs
portably and give them much better efficiency than currently possible
on platforms that make a minimal effort to implement (2) or (3).

call realloc(a,[10,2000000])

Reallocate the array a with the new bounds, keeping old content for
overlapping elements. The reallocation is performed as if the array
has been flattened into a rank 1 array, grown/shrunk to the desired
size, and then reshaped in place to the new size. Internally, this
may use reserved space, C's realloc, or even explicit VM page
manipulation. If there is reserved space, reallocations that use less
total space than the reserved space many not require any kind of new
storage allocation.

d = reservation(a)

Returns a hint about the current reservation for a. The result is
either the actual array dimensions, the exact reservation, a vector at
least as large in each dimension than the actual reservation (if the
system has reserved extra space), or a vector of huge values if the
system uses an efficient reallocation scheme that does not require
reservations. (The intent is that code can determine whether a
reallocation request can be carried out efficiently by checking
whether each dimension is no larger than the reservation.)

call reserve(a,[10,1000000])

Reserve storage for future growth but don't change the allocation. If
any of the dimensions is smaller than the current size, it is treated
as if it were given as the current size. This may be called on an
unallocated array and will then affect the next allocation. If it is
called on an allocated array, it may cause the array to be reallocated
with identical bounds and the new reservation immediately, or it may
take effect at the next explicit reallocation.

call growth(a,[1.0,1.5])

Whenever the array has grown beyond its reserve and the user attempts
reallocation to some size, allocate extra space as given by the growth
factors. In this case, "call realloc(a,[n,m])" or "allocate(a(n,m))"
would implicitly be followed by a "call
reserve(a,[floor(1.0*n),floor(1.5*m)]). The default growth for any
array is 1.0 in every dimension.

call tighten(a)

Shrink the reserve of the array to its actual allocation and
deallocate any storage freed in the process. This is always a
conformal operation.

call conformal_realloc(a,[50,50],offset=[10,10])

Perform a conformal reallocation of the array, mapping the array
elements onto the new array index-by-index. The contents of newly
allocated elements is undefined. A good implementation will attempt
to perform any necessary data movements in place if possible. Other
possibilities for conformal remapping include laying out the array to
avoid data movement on resize when the array has a multidimensional
reservation, and/or controlling the array-too-page mapping so that the
array can be extended without copying by rearranging pages in memory.

A good system for conformal reallocation is much harder to implement
and much less frequently needed than the non-conformal version;
implementors should get the non-conformal version out the door
quickly, since that has a real performance impact for a lot of code.

Thomas Breuel

2008-11-10 14:07:30 UTC

Permalink

Post by Thomas Breuel
(1) each subroutine does nothing

I meant... except for the realloc, conformal_realloc, and reservation
function, which would simply use the standard Fortran primitives to
make code work correctly on implementations that have no
realloc/reservation support.

Tom

Thomas Breuel

2008-11-06 22:40:20 UTC

Permalink

Post by Paul Richard Thomas
Thus, to be standard compliant, a fortran compiler must take the
performance hit that you talk about.

It's not that simple. The Fortran standard lacks an efficient
explicit reallocation primitive; that's a problem, but the explicit
case is pretty easy to address with a library routine.

What can't be addressed with a library routine is optimizing
temporaries, either removing them completely, or reallocating them
using realloc. That's where gfortran developers need to pay
attention. There are many situations that can be optimized; the
section of the Fortran standard you cite no more prohibits optimizing
temporary storage than other parts prohibit CSE.

Post by Paul Richard Thomas
Ultimately, of course, the programmer is obliged, if all else fails,
to use the same strategy as realloc; allocate oversize and only
de-allocate/allocate when needed.

Good implementations of malloc use VM hardware and page remapping,
rather than reserve extra space. Look at the mremap system call on
Linux.

Post by Paul Richard Thomas
This will become and issue when we implement automatic allocation for
allocatable arrays, a la f2003.

Well, I think it's important that whoever works on that have a good
understanding of dynamic memory management strategies, in addition to
what actual malloc implementations do. There seem to be a lot of
prejudices and misconceptions people have about memory management.

Efficient memory management is at least as hard as good numerics, and
just as important for high performance computing. The good news is
that the Fortran compiler can potentially do a lot more for the
programmer automatically than C compilers can.

Tom

Paul Richard Thomas

2008-11-10 09:02:36 UTC

Permalink

Tom,

Post by Thomas Breuel
What can't be addressed with a library routine is optimizing
temporaries, either removing them completely, or reallocating them
using realloc. That's where gfortran developers need to pay
attention. There are many situations that can be optimized; the
section of the Fortran standard you cite no more prohibits optimizing
temporary storage than other parts prohibit CSE.

I agree with you entirely about temporaries. gfortran's performance
will almost certainly improve markedly if temporaries are not freed
but are kept for reuse via realloc.. One of the reasons why gfortran
does as well as it does, in comparison with commercial products, is
that we have paid a lot of attention to reducing temporary usage, via
fairly rigorous dependency analysis. We could go further by deploying
loop reversal (patch available), loop order rearrangement(present in
gfortran but commented out because it doesn't work correctly yet) and
breaking up long loops but I rather think that memory management is
the next most effective step.

Post by Thomas Breuel

Post by Paul Richard Thomas
Ultimately, of course, the programmer is obliged, if all else fails,
to use the same strategy as realloc; allocate oversize and only
de-allocate/allocate when needed.

Good implementations of malloc use VM hardware and page remapping,
rather than reserve extra space. Look at the mremap system call on
Linux.

I understand this but reducing the number of system calls by
allocating oversize will certainly result in a performance
improvement.

Post by Thomas Breuel

Post by Paul Richard Thomas
This will become and issue when we implement automatic allocation for
allocatable arrays, a la f2003.

True but a naive application with realloc would be a good start:-)

Post by Thomas Breuel
Efficient memory management is at least as hard as good numerics, and
just as important for high performance computing. The good news is
that the Fortran compiler can potentially do a lot more for the
programmer automatically than C compilers can.

Agreed - Paul

--
The knack of flying is learning how to throw yourself at the ground and miss.
--Hitchhikers Guide to the Galaxy

Janne Blomqvist

2008-11-06 21:38:26 UTC

Permalink

Post by Thomas Breuel

Good realloc implementations will avoid copies for all sizes larger
than a single page, not just very large arrays.

I don't see how this is possible. Your benchmark below hits exactly the
point I mentioned above, i.e. there is never another allocation in the
way preventing realloc() from just enlarging the existing area.

In glibc malloc, the mmap threshold is nowadays dynamically between 128
kB and 64 MB, much larger than a single page. See
http://sourceware.org/ml/libc-alpha/2006-03/msg00033.html for
justification and mentions of workloads were always using mmap (which
would benefit realloc) is a bad choice.

It actually seems that glibc malloc is one of the few mallocs out there
that use mmap. Windows doesn't do it, Solaris doesn't, FreeBSD doesn't,
Google tcmalloc doesn't. As a result googling around you'll find plenty
of people complaining that realloc performance sucks on their system
compared to Linux. This doesn't necessarily mean that all other mallocs
are bad, but perhaps realloc performance is not that big deal for most
workloads.

Post by Thomas Breuel
I think people simply don't appreciate how big the difference between
realloc and MOVE_ALLOC really is, and how inefficient dynamic storage
management in GFortran can be.

I certainly appreciate it (and thanks for the benchmarks and the bug
report, btw). In fact, since probably the only use of MOVE_ALLOC is to
reallocate via an temporary array, I'd say putting it into the standard
was stupid and it would have been much better with, say, a REALLOCATE
statement that would essentially do the same as realloc. But anyway,
that's water under the bridge now.

Well, such a REALLOCATE statement would necessarily be more complicated
than realloc due to the need to handle multidimensional arrays, but
still it would be relatively easy to implement a fairly optimal one as a
realloc() wrapper. Whereas recognizing and optimizing the temporary
array + MOVE_ALLOC pattern is, I fear, far from trivial.

I think that optimizing the "a = [a,x]" thing, while not trivial, would
at least be feasible. And that would, to a large extent, be the same
thing as REALLOCATE. So perhaps things aren't as bleak as they seem.

--
Janne Blomqvist

Thomas Breuel

2008-11-08 03:46:48 UTC

Permalink

Actually, if you run strace on the binary, you'll see that it does
call mremap; you get the same behavior for many simultaneous
allocations.

In glibc malloc, the mmap threshold is nowadays dynamically between 128 kB and 64 MB,
much larger than a single page.

Yes, but that's mostly due to the history of that code. Even with
those historical limitations, Linux realloc is already orders of
magnitude faster than the fastest possible solution currently
available in gfortran, and the difference is only going to get larger
in the future.

As a result googling around you'll find plenty of people complaining that realloc performance sucks on their system compared to Linux.

Yes; I think Linux is going to drive other vendors to implement better
reallocs in their malloc implementations as well.

As a result googling around you'll find plenty of people complaining that realloc performance
sucks on their system compared to Linux. This doesn't necessarily mean that all other
mallocs are bad, but perhaps realloc performance is not that big deal for most workloads.

It's not a big deal for most other languages. In C and C++,
developers get full control over memory management, and they can
implement allocation strategies that work well without relying on
realloc, but Fortran doesn't permit that. Other language runtimes
don't use malloc/free at all and manipulate the VM hardware directly,
but runtimes like gfortran don't do that. Calling realloc in some
places in Fortran isn't as good as implementing a Fortran-specific
storage allocator would be, but it's better than the current

Well, such a REALLOCATE statement would necessarily be more complicated than realloc
due to the need to handle multidimensional arrays

Given Fortran's array facilities, a useful REALLOCATE statement would
limit how multidimensional arrays can be reallocated, but there's
nothing wrong with that.

So, taking that back to the beginning: I think supporting "a = [a,x]"
via realloc would be nice, and providing a simple REALLOCATE library
subroutine would also be good. The latter I can do myself; for the
former, I hope that the compiler writers will consider implementing
it.

Tom

Janne Blomqvist

2008-11-08 13:12:38 UTC

Permalink

Post by Thomas Breuel

Actually, if you run strace on the binary, you'll see that it does
call mremap; you get the same behavior for many simultaneous
allocations.

I didn't claim otherwise. Obviously as the size gets big enough that the
allocator switches to mmap, it will uses mremap after that. Running your
C benchmark with "ltrace -S" which traces both library calls and
syscalls, with glibc 2.5 I see the following behavior:

1. The first malloc causes two brk() syscalls that results in increasing
the data segment size by 0x21000 = 132 kB.

2. Until the size of the data reaches that 132 kB, realloc() doesn't
cause any further syscalls. I.e. it uses the data segment that the
allocator reserved from the OS in step 1. At this stage, if the program
would have other allocations that would use the data segment, it's
likely that realloc() would be forced to copy the data, and at some
point this would result in further brk syscalls to increase the data
segment size.

3. When the size reaches 132 kB, realloc calls the mmap2 syscall, which
allocates an area of about that size. Although the trace doesn't show
it, at this point realloc is obviously also forced to copy the data over
from the data segment into this mmap area.

4. For the rest of the program, mremap is called every 4 kB (one page)
to increase the size of the mmap:ed allocation.

As you see, the behavior is consistent with glibc malloc having a mmap
threshold starting at 128 kB. Now this is obviously specific to the
glibc allocator; as I mentioned previously most other allocators don't
seem to use mmap (and hence mremap) at all (though I was wrong about
FreeBSD, that does use mmap similar to the glibc allocator).

Post by Thomas Breuel

In glibc malloc, the mmap threshold is nowadays dynamically between 128 kB and 64 MB,
much larger than a single page.

Yes, but that's mostly due to the history of that code.

Huh? No, the mmap threshold is larg(ish) because mmap is a very slow
syscall. Most allocations are both small and short-lived, and lowering
the mmap threshold would increase memory management overhead for many
programs.

Post by Thomas Breuel
It's not a big deal for most other languages. In C and C++,
developers get full control over memory management, and they can
implement allocation strategies that work well without relying on
realloc, but Fortran doesn't permit that. Other language runtimes
don't use malloc/free at all and manipulate the VM hardware directly,
but runtimes like gfortran don't do that. Calling realloc in some
places in Fortran isn't as good as implementing a Fortran-specific
storage allocator would be, but it's better than the current

I don't think that a Fortran-specific allocator would help. As you
mention, in C/C++ when the developer knows the memory behavior of his
own program, he can use a custom allocator and get better performance
than the standard malloc/free. But the Fortran compiler doesn't know
what the developer wants (and analyzing the program so that it can e.g.
switch to a pool allocator in some part of the program is far beyond the
state of compiler technology), and hence it has to provide a general
purpose allocator that works well for most workloads, which is precisely
what malloc/free is.

As an aside, at least ifort and pathscale also use malloc/free for heap
management, so gfortran isn't alone.

Now, for some highly dynamical F2003 programming style with plenty of
mostly small allocations/reallocations/deallocations, perhaps a
compacting GC would offer better performance. But that would also
probably reduce performance for the vast majority of Fortran programs
that do relatively few and large allocations.

--
Janne Blomqvist

Thomas Breuel

2008-11-08 16:23:30 UTC

Permalink

Post by Janne Blomqvist
I didn't claim otherwise. Obviously as the size gets big enough that the
allocator switches to mmap, it will uses mremap after that.

Yes, you keep proving that the Linux realloc--although better than the
other malloc libraries out there--isn't very good; we agree on that.
On modern hardware, one could do a lot better. But even limited as it
is, the portable C code I gave runs orders of magnitude faster than
anything I can currently write in portable Fortran 2003.

I don't expect gfortran to do a really good job at array memory
management(*); I realize that the resources just aren't there to do
that. But I think at least it should aim to do as well as a simple,
portable C program calling standard C library functions in the dumbest
possible way, and I hope this discussion and the benchmarks I posted
will draw a bit of attention to this issue.

Tom

(*) Controlling the array-to-page mappings and layout, using sparse
address spaces, keeping runtime statistics, generational allocation,
preallocation, eliminating array temporaries, implementing
copy-on-write, etc. This isn't rocket science; people have plenty of
experience with these issues from work on array libraries in C++,
other high performance array languages, memory mapped array storage
libraries, etc.