i've seen issue happens when bootstrapping new nodes datastax enterprise cassandra cluster (ver: 2.0.10.71)
when starting new node bootstrapped, bootstrap process starts stream data other nodes in cluster. after short period of time (usually min or less) - other nodes in cluster show high par new gc pause times , nodes drop off cluster, failing stream session.
info [main] 2015-04-27 16:59:58,644 streamresultfuture.java (line 91) [stream #d42dfef0-ecfe-11e4-8099-5be75b0950b8] beginning stream session /10.1.214.186
info [gossiptasks:1] 2015-04-27 17:01:06,342 gossiper.java (line 890) inetaddress /10.1.214.186 down
info [handshake-/10.1.214.186] 2015-04-27 17:01:21,400 outboundtcpconnection.java (line 386) handshaking version /10.1.214.186
info [requestresponsestage:11] 2015-04-27 17:01:23,439 gossiper.java (line 876) inetaddress /10.1.214.186 up
then on other node:
10.1.214.186 error [stream-in-/10.1.212.233] 2015-04-27 17:02:07,007 streamsession.java (line 454) [stream #d42dfef0-ecfe-11e4-8099-5be75b0950b8] streaming error occurred
also see things in logs:
10.1.219.232 info [scheduledtasks:1] 2015-04-27 18:20:19,987 gcinspector.java (line 116) gc parnew: 118272 ms 2 collections, 980357368 used; max 12801015808
10.1.221.146 info [scheduledtasks:1] 2015-04-27 18:20:29,468 gcinspector.java (line 116) gc parnew: 154911 ms 1 collections, 1287263224 used; max 12801015808`
it seems happens on different nodes each time try bootstrap new node.
i've found related ticket. https://issues.apache.org/jira/browse/cassandra-6653
my guess when new node comes lot of compactions firing off , might causing gc pause times, had considered setting concurrent_compactors = 1/2 total cpu
anyone have idea?
edit: more details around gc settings using i2.2xlarge nodes on ec2:
max_heap_size="12g"
heap_newsize="800m"
also
jvm_opts="$jvm_opts -xx:+useparnewgc"
jvm_opts="$jvm_opts -xx:+useconcmarksweepgc"
jvm_opts="$jvm_opts -xx:+cmsparallelremarkenabled"
jvm_opts="$jvm_opts -xx:survivorratio=8"
jvm_opts="$jvm_opts -xx:maxtenuringthreshold=1"
jvm_opts="$jvm_opts -xx:cmsinitiatingoccupancyfraction=75"
jvm_opts="$jvm_opts -xx:+usecmsinitiatingoccupancyonly"
jvm_opts="$jvm_opts -xx:+usetlab"
with dse crew - following settings helped us.
with i2.2xlarge node (8 cpu, 60g of ram, local ssd only)
increasing heap new size 512m * num cpu (in our case 4g) setting memtable_flush_writers = 8 setting concurrent_compactors = total cpu / 2 (in our case 4)
making these changes no longer seeing parnew gc times exceeding 1sec on bootstrap (previously seeing 50-100 second gc times). fwiw don't see parnew gc times during normal operation - bootstrap.