Project

General

Profile

Actions

Bug #3609

closed

mon: track down the Monitor's memory consuption sources

Added by Joao Eduardo Luis over 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
Normal
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Left a couple of monitors running overnight with heap profiling active.

Don't have the logs, as I wasn't expecting it to work (yet again) and had simply forgotten to kill the cluster.

This however produced some neat heap profiles that were finally usable -- my guess is that disabling -O2 and letting the monitors run free during the night helped a lot.

Now, the good stuff.

mon.b

Total: 337.4 MB
   337.4 100.0% 100.0%    337.4 100.0% ceph::buffer::create_page_aligned

[snip]

mon.c

Total: 2269.0 MB
  2269.0 100.0% 100.0%   2269.0 100.0% ceph::buffer::create_page_aligned

Note that in this case mon.b is the leader. At least it was supposed to be, unless something went awfully wrong and quorum was broken. I have no idea if that indeed happened, as I just killed them when I logged back into the server and noticed they were up.

By the way, this is running v0.55 (690f8175606edf37a3177c27a3949c78fd37099f).

Full heap profile attached.


Files

ceph-mon.profile.heap (11 KB) ceph-mon.profile.heap Joao Eduardo Luis, 12/12/2012 09:13 AM
mon-a.r01.ps (58.1 KB) mon-a.r01.ps Joao Eduardo Luis, 12/16/2012 04:30 AM
mon-b.r01.ps (66.4 KB) mon-b.r01.ps Joao Eduardo Luis, 12/16/2012 04:30 AM
mon-c.r01.ps (66.3 KB) mon-c.r01.ps Joao Eduardo Luis, 12/16/2012 04:30 AM

Related issues 1 (0 open1 closed)

Related to Ceph - Bug #3569: Monitor & OSD failures when an OSD clock is wrongCan't reproduce12/04/2012

Actions
Actions #1

Updated by Joao Eduardo Luis over 11 years ago

  • Description updated (diff)
Actions #2

Updated by Joao Eduardo Luis over 11 years ago

Looks like it make sense that mon.c has a bigger memory consumption, from what the profile indicates. Apparently, most memory consumption comes from Monitor::forward_leader_request(), and mon.c being a peon would forward all the PaxosService messages to mon.b (mon.b was a peon as well for some time before I manually killed mon.a for some reason, hence its memory consumption as well).

Actions #3

Updated by Greg Farnum over 11 years ago

Are we failing to clean up forwarded messages? That would certainly be a pretty bad leak that we need to deal with as soon as possible!

Updated by Joao Eduardo Luis over 11 years ago

Attaching 3 heap profiles from the monitors.

The monitors were under load from the mon workload gen, as well as some osd and mds thrashing.

mon.a was eventually OOM-killed after reaching well over 6GB VIRT mem.

The source of this behavior appears to lie on the in-memory maps, which apparently can grow without bounds. The workload gen does a lot of pg map updates, and the monitors will keep them all. As of this moment, I'm yet to assess whether this is an issue with the way of how we keep the in-memory pgmap, or if this is some weird behavior triggered by incorrect behavior on the workload generator.

Actions #5

Updated by Sage Weil over 11 years ago

Which in memory maps? Nothing should grow without bound, except perhaps some of the intern monitor messages...

Actions #6

Updated by Ian Colle over 11 years ago

  • Priority changed from High to Normal
Actions #7

Updated by Joao Eduardo Luis about 11 years ago

  • Status changed from New to 4

While testing a bug fix, I noticed the leader would starting on a memory consumption rampage right after being started (3.2GB VIRT, 2.8GB RES in the first 2 minutes or so).

Curious to see if the monitor was suffering from the same as an MDS that showed up on #ceph a few days ago, I proceeded to recompile the monitor with tcmalloc and restarting it -- same thing happened right in the first minutes, but this time I had stats:

ubuntu@mira050:~/joao/ceph/src$ ./ceph -m 127.0.0.1:6789 heap stats
mon.atcmalloc heap stats:------------------------------------------------
MALLOC:       19091088 (   18.2 MB) Bytes in use by application
MALLOC: +   3236233216 ( 3086.3 MB) Bytes in page heap freelist
MALLOC: +      3361192 (    3.2 MB) Bytes in central cache freelist
MALLOC: +        88064 (    0.1 MB) Bytes in transfer cache freelist
MALLOC: +      1630152 (    1.6 MB) Bytes in thread cache freelists
MALLOC: +     17793024 (   17.0 MB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =   3278196736 ( 3126.3 MB) Actual memory used (physical + swap)
MALLOC: +      1847296 (    1.8 MB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =   3280044032 ( 3128.1 MB) Virtual address space used
MALLOC:
MALLOC:           5188              Spans in use
MALLOC:             17              Thread heaps in use
MALLOC:           4096              Tcmalloc page size
------------------------------------------------
Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take 

And so I tried to release the memory (following what Greg suggested on IRC during said MDS memory rampage as well):

ubuntu@mira050:~/joao/ceph/src$ ./ceph -m 127.0.0.1:6789 heap release
mon.a releasing free RAM back to system.
ubuntu@mira050:~/joao/ceph/src$ ./ceph -m 127.0.0.1:6789 heap stats
mon.atcmalloc heap stats:------------------------------------------------
MALLOC:       16099256 (   15.4 MB) Bytes in use by application
MALLOC: +       167936 (    0.2 MB) Bytes in page heap freelist
MALLOC: +      3432272 (    3.3 MB) Bytes in central cache freelist
MALLOC: +       126464 (    0.1 MB) Bytes in transfer cache freelist
MALLOC: +      1624824 (    1.5 MB) Bytes in thread cache freelists
MALLOC: +     17793024 (   17.0 MB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =     39243776 (   37.4 MB) Actual memory used (physical + swap)
MALLOC: +   3240800256 ( 3090.7 MB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =   3280044032 ( 3128.1 MB) Virtual address space used
MALLOC:
MALLOC:           5216              Spans in use
MALLOC:             13              Thread heaps in use
MALLOC:           4096              Tcmalloc page size
------------------------------------------------

RES dropped to little over 15MB. Obtaining the stats some time after shows that RES is increasing, but still under 30MB.

VIRT however is still over 3.1GB, but, of those, 3GB are marked as being 'released to OS (aka unmapped)'.

Actions #8

Updated by Joao Eduardo Luis about 11 years ago

starting monitors with tcmalloc noticed the following (both are peons):

=== mon.b === 
Starting Ceph mon.b on mira050...
mkdir: cannot create directory `/var/run/ceph': Permission denied
bash: line 0: ulimit: open files: cannot modify limit: Operation not permitted
starting mon.b rank 1 at 127.0.0.1:6790/0 mon_data dev/mon.b fsid 1a5b586b-2d95-4753-8498-fe003f695587
tcmalloc: large alloc 2013265920 bytes == 0xcce40000 @  0x7fd62bccba45 0x7fd62b833a89 0x7fd62b8347fb 0x7fd62b8348e0 0x7fd62b834c10 0x57c0ef 0x48ea81 0x505d82 0x493cdf 0x4abc8a 0x48bb59 0x7fd62aee576d 0x48e07d
=== mon.c === 
Starting Ceph mon.c on mira050...
mkdir: cannot create directory `/var/run/ceph': Permission denied
bash: line 0: ulimit: open files: cannot modify limit: Operation not permitted
starting mon.c rank 2 at 127.0.0.1:6791/0 mon_data dev/mon.c fsid 1a5b586b-2d95-4753-8498-fe003f695587
tcmalloc: large alloc 2013265920 bytes == 0xcc984000 @  0x7ff89db23a45 0x7ff89d68ba89 0x7ff89d68c7fb 0x7ff89d68c8e0 0x7ff89d68cc10 0x57c0ef 0x48ea81 0x505d82 0x493cdf 0x4abc8a 0x48bb59 0x7ff89cd3d76d 0x48e07d

That's 2GB.

Actions #9

Updated by Joao Eduardo Luis about 11 years ago

btw, and as such:

ubuntu@mira050:~/joao/ceph/src$ ./ceph -m 127.0.0.1:6790 heap stats
mon.btcmalloc heap stats:------------------------------------------------
MALLOC:       15300128 (   14.6 MB) Bytes in use by application
MALLOC: +   5669994496 ( 5407.3 MB) Bytes in page heap freelist
MALLOC: +      4094376 (    3.9 MB) Bytes in central cache freelist
MALLOC: +        83456 (    0.1 MB) Bytes in transfer cache freelist
MALLOC: +      3439160 (    3.3 MB) Bytes in thread cache freelists
MALLOC: +     29786112 (   28.4 MB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =   5722697728 ( 5457.6 MB) Actual memory used (physical + swap)
MALLOC: +     10424320 (    9.9 MB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =   5733122048 ( 5467.5 MB) Virtual address space used
MALLOC:
MALLOC:           3753              Spans in use
MALLOC:             13              Thread heaps in use
MALLOC:           4096              Tcmalloc page size
------------------------------------------------
ubuntu@mira050:~/joao/ceph/src$ ./ceph -m 127.0.0.1:6790 heap release
mon.b releasing free RAM back to system.
ubuntu@mira050:~/joao/ceph/src$ ./ceph -m 127.0.0.1:6790 heap stats
mon.btcmalloc heap stats:------------------------------------------------
MALLOC:       15305488 (   14.6 MB) Bytes in use by application
MALLOC: +        28672 (    0.0 MB) Bytes in page heap freelist
MALLOC: +      4106008 (    3.9 MB) Bytes in central cache freelist
MALLOC: +        75776 (    0.1 MB) Bytes in transfer cache freelist
MALLOC: +      3442136 (    3.3 MB) Bytes in thread cache freelists
MALLOC: +     29786112 (   28.4 MB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =     52744192 (   50.3 MB) Actual memory used (physical + swap)
MALLOC: +   5680377856 ( 5417.2 MB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =   5733122048 ( 5467.5 MB) Virtual address space used
MALLOC:
MALLOC:           3748              Spans in use
MALLOC:             13              Thread heaps in use
MALLOC:           4096              Tcmalloc page size
------------------------------------------------
Actions #10

Updated by Joao Eduardo Luis about 11 years ago

It appears that, when starting a monitor, we will o through all the pg and osdmap incremental versions and apply them.

tcmalloc: large alloc 1207959552 bytes == 0x4a4e4000 @  0x7f134fccaa45 0x7f134f832a89 0x7f134f8337fb 0x7f134f8338e0 0x7f134f833c10 0x6e0162 0x6d86f5 0x6db2b5 0x6db636 0x57b9c4 0x4a3b07 0x48aac1 0x7f134eee476d 0x48e07d

[snip]

2013-04-19 05:28:08.028707 7f1350e29780 10 mon.a@-1(probing).paxos(paxos recovering c 113441..113467) init last_pn: 128100 accepted_pn: 128001 last_committed: 113467 first_committed: 113441
2013-04-19 05:28:08.028739 7f1350e29780 10 mon.a@-1(probing).paxos(paxos recovering c 113441..113467) init
2013-04-19 05:28:08.028917 7f1350e29780  7 mon.a@-1(probing).pg v0 update_from_paxos loading latest full pgmap v25501
2013-04-19 05:28:08.029241 7f1350e29780  7 mon.a@-1(probing).pg v25501 update_from_paxos  applying incremental 25502

[snip]

2013-04-19 05:28:08.045042 7f1350e29780  7 mon.a@-1(probing).pg v25974 update_from_paxos  applying incremental 25975
2013-04-19 05:28:08.045048 7f1350e29780 10 mon.a@-1(probing).pg v25975 v25975: 24 pgs: 24 active+degraded; 9518 bytes data, 82028 MB used, 1630 GB / 1802 GB avail; 21/42 degraded (50.000%)
2013-04-19 05:28:08.045075 7f1350e29780 10 mon.a@-1(probing).pg v25975 send_pg_creates to 0 pgs
2013-04-19 05:28:08.045081 7f1350e29780 10 mon.a@-1(probing).pg v25975 update_logger

Some 400+ incremental versions were applied.

Same goes for the osdmap, the only difference is tat we start with latest full map 1002 and apply incremental versions up to 25296 -- that's a lot of incrementals for this to be right.

Actions #11

Updated by Sage Weil almost 11 years ago

  • Status changed from 4 to Resolved
Actions

Also available in: Atom PDF