Staged service restart with Ansible
I’ve been working on a small project to create a Cassandra Cluster for Development purposes. I’m using Vagrant and Ansible to deploy a 5-node Cassandra Cluster and node #5 would always fail to join the cluster.
I checked /var/log/cassandra/cassandra.log and this is what I found;
INFO [InternalResponseStage:1] 2017-09-09 18:49:07,673 ColumnFamilyStore.java:406 - Initializing system_auth.roles
INFO [main] 2017-09-09 18:49:08,666 StorageService.java:1439 - JOINING: waiting for schema information to complete
ERROR [main] 2017-09-09 18:49:09,687 MigrationManager.java:172 - Migration task failed to complete
ERROR [main] 2017-09-09 18:49:10,688 MigrationManager.java:172 - Migration task failed to complete
INFO [main] 2017-09-09 18:49:12,952 StorageService.java:1439 - JOINING: schema complete, ready to bootstrap
INFO [main] 2017-09-09 18:49:12,952 StorageService.java:1439 - JOINING: waiting for pending range calculation
INFO [main] 2017-09-09 18:49:12,952 StorageService.java:1439 - JOINING: calculation complete, ready to bootstrap
Exception (java.lang.UnsupportedOperationException) encountered during startup: Other bootstrapping/leaving/moving nodes detected, cannot bootstrap while cassandra.consistent.rangemovement is true
java.lang.UnsupportedOperationException: Other bootstrapping/leaving/moving nodes detected, cannot bootstrap while cassandra.consistent.rangemovement is true
at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:902)
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:681)
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:612)
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:393)
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:600)
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:689)
ERROR [main] 2017-09-09 18:49:12,960 CassandraDaemon.java:706 - Exception encountered during startup
java.lang.UnsupportedOperationException: Other bootstrapping/leaving/moving nodes detected, cannot bootstrap while cassandra.consistent.rangemovement is true
at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:902) ~[apache-cassandra-3.11.0.jar:3.11.0]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:681) ~[apache-cassandra-3.11.0.jar:3.11.0]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:612) ~[apache-cassandra-3.11.0.jar:3.11.0]
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:393) [apache-cassandra-3.11.0.jar:3.11.0]
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:600) [apache-cassandra-3.11.0.jar:3.11.0]
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:689) [apache-cassandra-3.11.0.jar:3.11.0]
INFO [StorageServiceShutdownHook] 2017-09-09 18:49:12,988 HintsService.java:220 - Paused hints dispatch
WARN [StorageServiceShutdownHook] 2017-09-09 18:49:12,989 Gossiper.java:1538 - No local state, state is in silent shutdown, or node hasn't joined, not announcing shutdown
INFO [StorageServiceShutdownHook] 2017-09-09 18:49:12,989 MessagingService.java:984 - Waiting for messaging service to quiesce
INFO [ACCEPT-/192.168.44.105] 2017-09-09 18:49:13,002 MessagingService.java:1338 - MessagingService has terminated the accept() thread
INFO [StorageServiceShutdownHook] 2017-09-09 18:49:13,360 HintsService.java:220 - Paused hints dispatch
With the section of interest being;
Exception (java.lang.UnsupportedOperationException) encountered during startup: Other bootstrapping/leaving/moving nodes detected, cannot bootstrap while cassandra.consistent.rangemovement is true
java.lang.UnsupportedOperationException: Other bootstrapping/leaving/moving nodes detected, cannot bootstrap while cassandra.consistent.rangemovement is true
When I manually started the service it would join the cluster with no issues. There was clearly a timing issue here preventing the final node from joining the cassandra ring. I thought the solution might lie in using the serial ansible keyword but this is only applicable to the play, not the task level, and it didn’t have the level of control I wanted.
I found some discussion of the issue, on the ansible github, and adapted a workaround to include a sleep between each cassandra service start.
This makes clever use of the delegate_to to execute a sleep and service restart on each host. This staged execution of the cassandra service start allowed all nodes to join the ring successfully.