Evaluating using Docker Swarm

From gaframework.org
Jump to: navigation, search

Introduction

This article describes how a network of computers can be utilised to perform GA evaluations. The example detailed here makes use of many instances of the GAF Evaluation Server, each running in a Docker Container and managed by Docker Swarm Service. The Swarm service provides a single endpoint, load balancing and service discovery. This allows GA evaluations to be undertaken on a network of computers (cluster) sucha as a Beowulf Cluster, see https://en.wikipedia.org/wiki/Beowulf_cluster.

Requirements

This example requires the following

  • A network of computers (cluster), each installed with Docker.
  • A development machine with Visual Studio/Mono Develop that has network access to the cluster.
  • Both the GAF and the GAF.Network Nuget packages.

The cluster used in the example here was a 28 core ARM based cluster with each node running Arch Linux. Therefore, references to directories will be specified as Linux paths etc. Please substitute the appropriate directories/folder names based on the systems you are using.

GAF Evaluation Server

Details of the GAF Evaluation Server can be found here.

To keep things simple, a standard Mono/.Net Docker image can be used. The GAF Evaluation executable and the fitness assembly (see below) can be placed on the host machine and accessed by each Docker container within the Swarm service. This means that a standard Docker image with Mono/.Net, available from the Docker Hub (https://hub.docker.com), can be used. In this case for the ARM based cluster the arm32v7/mono image was selected.

Fitness Function

For the GAF Evaluation Servers to perform an evaluation, the fitness function and any associated helper methods and objects need to exist on each server. To accomplish this, the Fitness Function is placed in a separate assembly and either copied manually to the server or sent to the server using automated deployment mechanisms such as Ansible (https://www.ansible.com). Details of how to create this assembly are shown in the article Implementing IRemoteFitness.

In this example, the folder /opt/gaf on each node, was used to store the following files;

When creating the Swarm service, the /opt/gaf folder is bound to the same folder within the container, this means that the container will be able to access the above files as if they were within the container. See below for further details.

Creating a Docker Swarm Service

Assuming that Docker is installed on each node of the cluster, and the GAFEvaluation Server, remote fitness assembly and associated files are installed in the bound folder on each node as described above (/opt/gaf in this example), the service can be created with the following command.

   docker service create --replicas 16 --publish 11000:11000 --name gaf-eval-server --mount type=bind,src=/opt/gaf,dst=/opt/gaf johnnewcombe/mono-arm32v7 mono /opt/gaf/GAF.EvaluationServer.exe

This starts a docker swarm service that runs the GAF.EvaluationServer.exe under Mono. This executable listens on port 11000 by default so the service is configured with the --publish option to expose that port publicly. Sixteen instances (containers) of the service are run (--replicas 16) all accessed by the same IP and Port with docker handling the load balancing and service discovery. This can be scaled as required, for example, the following command will scale this up from 16 to 32 containers (tasks)

   docker service scale gaf-eval-server=32

Each container in the swarm has the internal folder /opt/gaf bound to the host directory of the same name. This allows tools such as Ansible (https://www.ansible.com/) to be used to deploy the application and fitness function to all nodes and therefore all containers within the swarm.

Details of the service can be obtained with the following commands

   docker service inspect --pretty gaf-eval-server
   docker service ls
   docker service ps gaf-eval-server

Using Docker Swarm to solve the Travelling Salesman Example

Once the swarm service is up and running, it is a simple job to modify the example shown in Solving the Travelling Salesman Problem to perform the evaluations using the GAF Evaluation Servers running as a Docker Swarm Service. All that is required is to use the GAF.Network namespace classes to wrap the GA.

The example below shows the the modified Travelling Salesman example code. The code utilises the addtitional NetWorkWrapper classes provided by GAF.Network NuGet package which will forward evaluations to the specified endpoint representing the GAF Evaluation Server Swarm service. To test the service simply run this on the development machine as normal.

The packages can be installed using the Package Manager console.

   PM> Install-Package GAF.Network

The code is very similar to the original code shown in the example Solving the Travelling Salesman Problem, however, the Network wrapper handles the network communication to the Swarm service endpoint.

The code...

   var networkWrapper = new NetworkWrapper(ga, endpoints, "Example.IRemoteFitness.dll", _concurrency);

creates the wrapper object and accepts a collection of endpoints and the concurrency. In this example the list of endpoints will contain only the endpoint of the Docker Swarm service, however, multiple endpoints could be specified thereby allowing for accessing multiple services or simply a collection of machines running the GAF Evaluation service.

The concurrency will determine how many simultaneous requests the client will make to the service for each population evaluation. Typically if there are 32 nodes all listening, then concurrency could be 32 or less.

In this example the OnEvaluation event of the network wrapper class is used to simply show how many servers were used to evaluate the population.

This code is available via BitBucket.

   using System;
   using System.Collections.Generic;
   using System.Linq;
   using GAF.Extensions;
   using GAF.Operators;
   using GAF;
   using System.Diagnostics;
   using GAF.Network;
   using Example.IRemoteFitness;
   using System.Net;
   
   namespace Example.DistributedEvaluation
   {
       public class Program
       {
           private static Stopwatch _stopWatch;
           private const int _runCount = 1;
           private const int _concurrency = 24;
           private const int _populationSize = 100;
           private static HashSet<IPAddress> _serversInUse = new HashSet<IPAddress>();
           private static object _syncLock = new object();
   
           private static void Main(string[] args)
           {
               //get our cities
               var cities = CreateCities().ToList();
   
               //Each city is an object the chromosome is a special case as it needs
               //to contain each city only once. Therefore, our chromosome will contain
               //all the cities with no duplicates
   
               //we can create an empty population as we will be creating the
               //initial solutions manually.
               var population = new Population(false, false);
   
               //create the initial solutions (chromosomes)
               for (var p = 0; p < _populationSize; p++)
               {
   
                   var chromosome = new Chromosome();
                   foreach (var city in cities)
                   {
                       chromosome.Genes.Add(new Gene(city));
                   }
   
                   chromosome.Genes.ShuffleFast();
                   population.Solutions.Add(chromosome);
               }
   
               //create the elite operator
               var elite = new Elite(5);
   
               //create crossover operator
               var crossover = new Crossover(0.85) { CrossoverType = CrossoverType.DoublePointOrdered };
   
               //create the SwapMutate operator
               var mutate = new SwapMutate(0.02);
   
               //note that for network fitness evaluation we simply pass null instead of a fitness
               //function.
               var ga = new GeneticAlgorithm(population, null);
   
               //subscribe to the generation and run complete events
               ga.OnGenerationComplete += ga_OnGenerationComplete;
               ga.OnRunComplete += ga_OnRunComplete;
   
               //add the operators
               ga.Operators.Add(elite);
               ga.Operators.Add(crossover);
               ga.Operators.Add(mutate);
   
               /****************************************************************************************
                * Up until now the GA is configured as if it were a non-distributed example except,
                * the fitness function is not specified (see note above)
                *
                * The NetworkWrapper (below) adds the networking functionality.
                *
                ***************************************************************************************/
   
               // using Command Arguments to pass endpoint(s)
               var endpoints = CreateEndpoints(args.ToList());
   
               //create the network wrapper, the fitness assembly name is passed in in order that any known types can be
               //extracted for chromosome serialisation etc.
               var networkWrapper = new NetworkWrapper(ga, endpoints, "Fitness.dll", _concurrency);
               networkWrapper.OnEvaluationComplete += nw_OnEvaluationComplete;
   
               _stopWatch = new Stopwatch();
               _stopWatch.Start();
   
               //locally declared terminate function
               networkWrapper.GeneticAlgorithm.Run(TerminateAlgorithm);
               networkWrapper.Dispose();
   
               //if we get here the algorithm has ended or been terminated by a keyboard key
               _stopWatch.Stop();
   
               Console.ReadLine();
   
           }
   
           private static bool TerminateAlgorithm(Population population, int currentGeneration, long currentEvaluation)
           {
               //terminate with any key
               if (Console.KeyAvailable)
               {
                   return true;
               }
   
               return currentGeneration >= 350;
           }
   
           private static IEnumerable<City> CreateCities()
           {
               var cities = new List<City>();
               cities.Add(new City("Birmingham", 52.486125, -1.890507));
               cities.Add(new City("Bristol", 51.460852, -2.588139));
               cities.Add(new City("London", 51.512161, -0.116215));
               cities.Add(new City("Leeds", 53.803895, -1.549931));
               cities.Add(new City("Manchester", 53.478239, -2.258549));
               cities.Add(new City("Liverpool", 53.409532, -3.000126));
               cities.Add(new City("Hull", 53.751959, -0.335941));
               cities.Add(new City("Newcastle", 54.980766, -1.615849));
               cities.Add(new City("Carlisle", 54.892406, -2.923222));
               cities.Add(new City("Edinburgh", 55.958426, -3.186893));
               cities.Add(new City("Glasgow", 55.862982, -4.263554));
               cities.Add(new City("Cardiff", 51.488224, -3.186893));
               cities.Add(new City("Swansea", 51.624837, -3.94495));
               cities.Add(new City("Exeter", 50.726024, -3.543949));
               cities.Add(new City("Falmouth", 50.152266, -5.065556));
               cities.Add(new City("Canterbury", 51.289406, 1.075802));
               return cities;
           }
   
           private static double CalculateDistance(Chromosome chromosome)
           {
               var distanceToTravel = 0.0;
               City previousCity = null;
   
               //run through each city in the order specified in the chromosome
               foreach (var gene in chromosome.Genes)
               {
                   var currentCity = (City)gene.ObjectValue;
   
                   if (previousCity != null)
                   {
                       distanceToTravel += previousCity.GetDistanceFromPosition(currentCity.Latitude,
                                                                           currentCity.Longitude);
                   }
   
                   previousCity = currentCity;
               }
   
               //add distance back to the starting point
               var firstCity = (City)chromosome.Genes[0].ObjectValue;
               distanceToTravel += previousCity.GetDistanceFromPosition(firstCity.Latitude,
                                                       firstCity.Longitude);
   
               return distanceToTravel;
           }
   
           public static List<IPEndPoint> CreateEndpoints(List<string> endpointAddresses)
           {
               List<IPEndPoint> ipEndPoints = new List<IPEndPoint>();
               foreach (var ea in endpointAddresses)
               {
   
                   var ep = NetworkWrapper.CreateEndpoint(ea);
                   if (ep != null)
                   {
                       ipEndPoints.Add(ep);
                   }
   
               }
   
               return ipEndPoints;
           }
   
           private static void ga_OnRunComplete(object sender, GaEventArgs e)
           {
               var fittest = e.Population.GetTop(1)[0];
               foreach (var gene in fittest.Genes)
               {
                   Console.WriteLine(((City)gene.ObjectValue).Name);
               }
           }
   
           private static void ga_OnGenerationComplete(object sender, GaEventArgs e)
           {
               var fittest = e.Population.GetTop(1)[0];
   
               var distanceToTravel = CalculateDistance(fittest);
               Console.WriteLine(String.Format("Generation: {0}, Evaluations: {1}, Fitness: {2}, 
                                                Distance: {3}, ElapsedTime: {4}ms, Servers Used: {5}",
                   e.Generation,
                   e.Evaluations,
                   fittest.Fitness,
                   distanceToTravel,
                   _stopWatch.ElapsedMilliseconds,
                   _serversInUse.Count())
               );
               _stopWatch.Restart();
               _serversInUse.Clear();
           }
   
           private static void nw_OnEvaluationComplete(object sender, GAF.Network.EvaluationEventArgs args)
           {
   
               if (args != null && args.IPAddress != null)
               {
                   lock (_syncLock)
                   {
                       _serversInUse.Add(args.IPAddress);
                   }
               }
           }
       }
   }