Distributed monitoring and control of nodes comprising Company’s product via custom built solution utilizing Apache Zookeeper.

Some of the problems included:

  • Company’s product, a distributed system deployed on multiple nodes in a cluster, required monitoring and control of its nodes
  • Services comprising Company’s product were deployed in a cluster, each on a separate node, which required establishing multiple SSH connections for control operations such as stop and start as well as to query their status
  • Some services comprising Company’s product often hanged when starting / stopping them, which prevented start/stop sequence for all services to complete successfully

Some of the solutions applied included:

  • Implementing Java agent, deployed on each node of cluster comprising Company’s product, allowing to start, stop and query for status specific service running on that node, with only one SSH connection to any one node of the cluster
  • Implementing service wrappers for each type of service, which agent application monitors and controls
  • Implementing start and stop order configuration for each service and executing synchronized start/stop commands sequence in order defined
  • Utilizing distributed lock (InterProcessSemaphoreMutex) to ensure no other application can initiate another start/stop sequence until previous one completes
  • Automating build of agent application via Hudson build server, so it can be integrated into existing deployment pipeline and its latest release distributed with new versions of Company’s product

Technology stack

  • Java
  • Groovy
  • Bash
  • Apache Zookeeper
  • Apache Curator
  • Apache Commons Exec
  • Apache Commons CLI

Industry

IT