Friday, 6 February 2015

SolrCloud With Zookeeper Ensemble on the AWS

1.0   Background

Solr is a faceted text based search service backed by Apache Lucene. Data is indexed allowing your UI to display results for a search query very quickly. You may already be running solr as a standalone application on a single instance which is also host to the Lucence indexes. The nature of the indexes means that the application is difficult to cluster and your Solr service is vulnerable. If the service was lost then recovery would take time to re-index rendering your application unusable for that period.  The latest stable version 4.10.6 includes SolrCloud which allows the Lucence index data to be shared across multiple Solr instances fronted by a load balancer. This provides both resilience and performance enhancements to the search service. Should one instance of the service be lost the service as a whole will continue to function. The service can be scaled to meet demand when required.

This post outlines the complete installation and architecture of SolrCloud running in the AWS.

2.0   Install & Set Up


2.1   Build the Basic Application Server Image

The first step is to build a server image which we will use as a template for launching our SolrCloud instances as nodes in the cluster.  As we will see in 2.2 our SolrCloud instances will be launched into a private subnet within a VPC. However, as a start, we will launch an EC2 classic instance and use this as a temporary machine on which to build the configuration and create an image as a template. So, take the first step an launch an image with appropriate security into EC2 from a basic Amazon Linux AMI. 

Now we have a running instance we can configure the server, install tomcat, Solr and Zookeeper and create an image which we will use as a template the nodes in our cluster.  Everyone has their own flavour of where and how to install these things, you may also visit the AMI market place and launch an instance from a pre-configured image. 

2.1.1 JDK 1.7
  • Download the JDK, I used jdk-7u71-linux-x64.gz
  • Unzip into /opt/
  • Add a java.sh file into /etc/profile.d/ which sets the JAVA_HOME environment variable and adds the bin/ directory to the path
  • Reboot and check the new SDK bin directory is set on the path and the JAVA_HOME environment variable correctly set.

2.1.2 Tomcat 7
  • Download tomcat 7, I used apache-tomcat-7.0.57.tar.gz
  • Unzip into /usr/share/tomcat7
  • Set the UTF-8 attribute in the Connector element in server.xml            
<Connector port="8080" protocol="HTTP/1.1"
connectionTimeout="20000"
redirectPort="8443" URIEncoding="UTF-8"/>
  • Set the management-gui role, user and password in conf/tomcat-users.xml
<role rolename="manager-gui"/>
<user username="username" password="password" roles="manager-gui"/>
  • install apr-devel & http-devel using the package manager
yum –y apr-devel httpd-devel
  • Add a tomcat user and set the home directory, remember to chown that to the new user.
useradd –d /user/share/tomcat7 –s/bin/bash tomcat
chown –R tomcat.tomcat /usr/share/tomcat7
  • Build the Commons Daemon JSVC script to run the tomcat process. This will allow us to run tomcat on boot up and is the standard way to run tomcat as a daemon process. Follow this guide but ensure you configure the server with the following additional steps when running configure.
  • When running configure use the installed JDK and APR and copy the compiled APR libraries into the runtime environment.
./configure –with-apr=/usr –with-java-home=/opt/jdk1.7.0_71
cp /usr/local/apr/lib/libtcnative-1.s* /opt/jdk1.7.0_71/jre/lib/amd64/server
  • Copy the daemon.sh file, which runs the jsvc script you have built and copied into the bin/ directory, into /etc/init.d/tomcat7 and set the run level using chkconfig so it boots up on start up.
# chkconfig: - 20 80
# description: starts tomcat - /usr/share/tomcat7 under user tomcat
  • Now set the environment variables for tomcat, there are various options for doing this but I just add another file under /etc/profile.d containing the following.
export CATALINA_HOME=/usr/share/tomcat7
export JSVC_OPTS="-Xms2048m -Xmx2048m"
You will now have a Tomcat server platform on which to add the SolrCloud application. You might want to create an Amazon Machine Image of the instance you have configured which you can use as a base template for a generic tomcat server.

2.1.3 Deploy Solr 4.10.3

Solr is simply a web application which runs in tomcat. The Solr distribution comes with a pre-built .war file which you can deploy into tomcat in the usual way. I always create a deployer context file and set any environment variable, such as the Solr home directory, in that context. This also allows us to deploy different versions of the application into the same location within tomcat when we want to upgrade.

  • Download the Solr package, I used Solr-4.10.3.tgz
  • Unpack the file into your home directory, we don’t need to upack into any particular location as we are going to cherry pick files within the distribution to deploy into tomcat. Find the solr.war file in:
solr-4.10.3/example/webapps/solr.war

  • Create a Solr Home directory somewhere outside of the tomcat application server. This will host the cores, including indexes and other core related configuration. I created my Solr Home in:
/var/lib/solr/
  • Ensure that the tomcat user has access to solr home
chown –R tomcat.tomcat /var/lib/solr
  • Now we can create the core. To do this I simply copied the example core folder from the solr distribution into my solr home directory. This includes the example schemas, configuration and indexes but we can overwrite and get rid of them later. 
cp –r solr-4.10.3/example/solr /var/lib/solr/
  • Create a deployer context for the solr.war file place the following file in:
/usr/share/tomcat7/conf/Catalina/localhost/solr.xml
<?xml version="1.0" encoding="UTF-8"?>
<Context path="/solr/home"
      docBase="/usr/share/tomcat7/deploy/solr.war">
        <Environment name="solr/home" override="true" type="java.lang.String" value="/var/lib/solr" />
</Context>
As you can see, I create a deploy directory within tomcat home where I place my web application resource files. 
  • Now edit the solr.xml configuration in /var/lib/solr/solr.xml. This is the main solr configuration file and defines our cloud formation. We will revisit this file a number of times. For now we only have one machine running so we will set up just for localhost for now. The solr.xml I have is pretty much the same as the example with the addition of the zkhost configuration element, here’s what the solrcloud stanza looks like in my solr.xml: 
<solrcloud>
    <str name="host">${host:}</str>
    <int name="hostPort">8080</int>
    <str name="hostContext">${hostContext:solr}</str>
    <int name="zkClientTimeout">${zkClientTimeout:30000}</int>
    <bool name="genericCoreNodeNames">${genericCoreNodeNames:true}</bool>
    <str name="zkHost">127.0.0.1:2181</str>
</solrcloud>
  • Now you can start tomcat which will deploy the solr.war web application into the webapps/ directory. You might want to check the catalina.out log and ensure the application has deployed, if not then add a logging file as specified below and up the log level to DEBUG. Once successfully deployed and running you can shut tomcat down again.
  • Use the log4j.properties included in the resources/ directory in the Solr example. Copy the file into the WEB-INF/classes/ directory of the deployed solr web app within tomcat. You may want to change the logging properties and specify a solr.log file somewhere appropriate, e.g. /var/log/solr/solr.log
  • In the core within the solr home directory you’ll find a core.properties file which contains a single entry specifying the name of the collection. This should reflect the name of your collection, the same name as the directory in which the core configuration and indexes are kept. In our case this is collection1 and the value in core.properties should reflect this.
  • Restart tomcat, confirm the Solr application has booted up cleanly. You can now open up a browser and login to your solr admin page and view the core.
Browse to http://<your-ec2-public-dns>/solr and you’ll see something like this 

Single instance of Solr


2.1.4 Deploy Zookeeper 3.4.6

Now we have the application running it is time to turn our attention to the underlying data itself and how this is will be kept in sync across the SolrCloud cluster. SolrCloud uses Zookeeper to keep the config and index data synchronised across the instances. In order to ensure we don’t have a single point of failure within the system we will cluster Zookeeper as well, this is known as the ensemble. In fact, the best option is to run Zookeeper on the each instance also running Solr within tomcat. This way we can have only single image which we can use as a template for all instances within the cluster. Zookeeper needs a majority number of instances to be up in order to maintain the service; therefore an odd number of instances should be deployed. For this reason we will have 3 instances, each running Solr and Zookeeper in the cluster.
  • Download Zookeeper zookeeper-3.4.6.tar.gz unzip into /usr/share/ zookeeper-3.4.6
  •  The main zookeeper configuration file is in the conf/ directory called zoo.cfg. There are various configurations here which we can tweak later, for now we will keep the defaults but we need to specify the zookeeper data directory. You should add this directory outside of the zookeeper application home, my dataDir configuration looks like this
dataDir=/var/lib/zookeeper/
I then create this directory
mkdir /var/lib/zookeeper

  • Whilst we are editing zoo.cfg note the client port value. This is the port Solr, the client, will use to communicate to zookeeper. By default this is set to port 2181 and we will need to ensure we specify this in our AWS security group later. 
  •  We need to ensure each zookeeper instance in the ensemble knows how to communicate with the other. Although we are currently installing one instance and no nothing about the other instances we will run in our cluster we can prepare the configuration. Again we do this in zoo.cfg and add the following lines: 

# zookeeper ensemble
server.1=localhost:2191:3888
#server.2=x.x.x.x:2191:3888
#server.3=x.x.x.x:2191:3888
I have left the configuration for server 2 & 3 commented out, we will need to specify these later on when we bring up our instances within the AWS VPC environment.
Again, note the ports. 2191 and 3888 are ports used within the zookeeper ensemble are different to the client port value. We will need to ensure these are specified in the AWS security group later on.

  • Each zookeeper instance in the ensemble needs to know which server it is. In the configuration we specified in zoo.cfg we have said that this instance, localhost, is server id 1. Zookeeper will look for a file called myid in the data directory. Within that file is a number which details which server this is. I created the file and edited it, adding a single value: 1 we will do this later on our other instances with values 2 & 3. 

touch /var/lib/zookeeper/myid

  • Whilst in the conf/ directory edit the log4j.properties file or create one if not already present. I set a rolling log in /var/log/zookeeper/zookeeper.out
  • We now need to make a zookeeper start script allowing the server to be run as a daemon process. I hacked the zkServer.sh script in the bin/ directory of the distribution and set the run levels as we did with tomcat and placed it in /etc/init.d. 
# chkconfig: - 20 80
# description: starts zookeeper - /usr/share/zookeeper-3.4.6 under user root
I then added the script and turned it on using the chkconfig command. 
chkconfig –add zookeeper
chkconfig zookeeper on 
  • This is a good point to test the zookeeper installation. We can boot up the server using the new start script in /etc/init.d/zookeeper or just run the zkServer.sh script in bin/ using the start command.
bin/zkServer.sh start
  • You should now be able to connect to the zookeeper server using the zkcli.sh script in the bin directory like so:
bin/zkCli.sh
This will connect and allow you to run zookeeper commands.

  • Now the server is running exit from the zkCli command line and we can upload the Solr configuration and core so that Zookeeper can do its thing and synchronise the indexes across multiple instances. We are still running just one instance so we’ll have to revisit this when we come to run it across multiple instances in the VPC. We do this step here just to allow us to boot up solr and ensure it runs cleanly with Zookeeper. We need to execute 3 commands using the zkCli.sh tool which comes with the Solr distribution we downloaded.

/home/ec2-user/solr-4.10.3/example/scripts/cloud-scripts/zkcli.sh
We will use this tool to upload the solr configuration files found in solr home, link the uploaded collection to the configuration and bootstrap zookeeper. Execute the following commands: 
/home/ec2-user/solr-4.10.3/example/scripts/cloud-scripts/zkcli.sh -zkhost 127.0.0.1:2181 -cmd upconfig -d /var/lib/solr/collection1/conf/ -n collection1_conf
/home/ec2-user/solr-4.10.3/example/scripts/cloud-scripts/zkcli.sh -zkhost 127.0.0.1:2181 -cmd linkconfig -collection collection1 -confname collection1_conf -solrhome /var/lib/solr/
/home/ec2-user/solr-4.10.3/example/scripts/cloud-scripts/zkcli.sh -zkhost 127.0.0.1:2181 -cmd bootstrap -zkhost 127.0.0.1:2181 -solrhome /var/lib/solr/

  • We can now restart tomcat and run Solr with as a single node cluster to test everything is correct. Start the server with the tomcat start script and check the solr and zookeeper logs to ensure everything is configured correctly. Once you’re happy shut it down again. 


2.1.5 Lucence Indexes and core definition

We can now concern ourselves with the actual data we are going to index, the schema and configuration of the data we are serving up. If you are migrating from an old single solr instance you will probably want to copy these files across into the conf/ directory of the core in solr home. Depending upon which version of Solr you are migrating from there will need to be some changes to these files. If you are using the default example core, named collection1 then you won’t need to do this but I am assuming you will want to progress to using a schema of your own at some point.
  • Copy the following files into the conf/ directory of the core within the solr home.
schema.xml
synonyms.xml
solrconfig.xml
stopwords.xml
spellings.xml
  • If you are upgrading from an older version of Solr prior to version 4 then you will need to add the following line to the schema.xml. 
<field name="_version_" type="long" indexed="true" stored="true"/>

  • If you wish to test Solr and view the details in the Solr admin page then you will need to reload the data into zookeeper. Use the zkcli.sh tool to issue the upconfig, linkconfig and bootstrap commands outlined in the last step of the previous section. You can then reboot tomcat and browse the Solr admin page on your running instance.

You should now image your instance. We will use this as a template from which we will launch instances within the VPC and create the cluster. Once you have created your AMI you can terminate the instance on which you created the template.

2.2      Build the AWS VPC Environment 

Now we have an image of our server which can be used as a template we need to define an environment in which to launch instances to form the cluster. To maintain security we will create our SolrCloud cluster within a private subnet in a Virtual Private Cloud, (VPC). We do not want any direct access to Solr from the outside world, all search services are fronted by a web application. Controllers in that application marshal requests to the Solr service and return results in the response. Therefore, we only need access to the Solr service from another instance. In my case this instance resides in the EC2 Classic environment and so it is an easy job to link it to the VPC using EC2 Classic link. The following steps will guide you through the process of building this VPC environment using the AWS console.

  •  Browse to the VPC dashboard and create a new VPC. Give it a name, something like SolrCloud-VPC. Set a CIDR block which will set the range of IP addresses available when launching instances. We will be implementing a private subnet so the instances will not be available from the outside world. I therefore set my CIDR block to 10.0.0.0/16 
  •  Create a new subnet and associate to the new VPC.  Give the subnet a CIDR block as a sub set of the VPC, I gave it 10.0.0.0/24 giving me 256 possible private IP addresses.
  • Create a Routing Table, name it SolrCloud-RT or something like that and associate with your VPC.
  • Now create a security group which will be associated to each instance we launch into our VPC. From our configuration in the previous sections we know which ports are needed for communication between the instances in the cluster and machines outside the VPC. We said that we would use EC2 Classic link from one EC2 Classic instance to the VPC  this will facilitate SSH access to our instances.  When you create a security group it will give you a predefined id, prefixed with the letters ‘sg’ when you specify a security group id in the inbound rules it means any instance associated with that group has access on that port to instances associated with the security group you are creating. This allows us to use a self-referencing set of rules for intercommunication between our instances without prior knowledge of instance IP addresses.
Here’s my inbound rules, note the two security groups are this group, for self-reference, and the security group of my EC2 Classic instance.
Protocol
Port
Source
SSH
22
sg-ec2classicSG
HTTP
80
sg-ec2classicSG
Custom TCP
8080
sg-this
Custom TCP
2181
sg-this
Custom TCP
2191
sg-this
Custom TCP
3888
sg-this


  • Create a load balancer. This will be used to marshal the HTTP requests to Solr from outside the VPC. When we launch instances we will associate them to this load balancer. Create the load balancer inside the SolrCloud VPC and configure the listener to forward HTTP requests on port 80 to HTTP port 8080 on the instances, this is the port our tomcat instance is running on.
The load balancer needs a mechanism to ensure the instances are up and running. It does with a Health Check ping. We want to ensure the Solr application is running so we choose a URL within that application. I simply set the path to the favicon.ico file in the Solr application root with the default timeouts and thresholds.
HTTP 8080 /solr/favicon.ico
Select the Solr subnet we set up previously. We could have the option to create multiple subnets in different availability zones within our VPC to give us high availability. This would be a good idea but for now let’s keep it simple and stick with one subnet.
Select the security group we created in the previous step, we will add our instances later so go ahead and create the load balancer.

  • Create the EC2 Classic link between the EC2 Classic instance and the VPN. As previously stated, we don’t expose Solr to the outside world. Access to Solr will be performed from another application which will make HTTP requests to the Solr load balancer within the VPC. My application resides on an EC2 Classic instance and it’s a simple case of selecting the instance and linking to the VPC.

2.3 Launch The Instances and Create the Cloud 

Now it’s time to create and launch the cluster. We will launch 3 instances from the AMI we made in section 2.1. We will need to amend some of the Zookeeper configurations we intentionally left commented out in section 2.1.4, the Solr.xml configuration which defines the zkhosts and also upload and link the Solr configurations into Zookeeper. We need to do this on each instance we launch and the configurations are subtly different on each, getting these right is important to ensure the cluster works as intended.
  • In the EC2 Dashboard select the AMI we made earlier and hit launch. The type of instance is up to you, I used t2.small which gives us ample memory and power to run Solr and Zookeeper. Select the number of instances to 3 and ensure we select the VPC and subnet we created in the previous section. I give it a volume of 8 Gig but again this will depend on the size of the index you intend to run in Solr. Give the instances a name, we can tweak this later and add numbers to each individual instance later. Select the security group we created previously and launch the instance with your key. Note, you will need to put this key on the EC2 classic instance from where you will hop your ssh session from. 
When each instance is up and running note the private IP addresses, write them down somewhere noting which is which server using a pencil and piece of paper! 

  • SSH into your EC2 Classic link the open a session to the first VPC instance. Check tomcat and zookeepr running by running the following commands, remember they should autostart on boot. 
ps –ef | grep tomcat
ps –ef | grep zookeeper
Shut down tomcat:
/etc/init.d/tomcat7 stop

  • Edit the hosts file in /etc/hosts change the local IP address to that of the machines private IP. You will now need to reload the network interface or reboot the machine. If you reboot you’ll need to SSH back in from your EC2 instance.
/etc/init.d/network restart or shutdown –r now

  • Configure solr, open up the file in /var/lib/solr/solr.xml and add your three private IP addresses to the solrcloud configuration stanza.

<str name="zkHost">x.x.x.x:2181,x.x.x.x:2181,x.x.x.x:2181</str>

  • Configure zookeeper.  Open up the file in /usr/share/zookeeper-3.4.6/conf/zoo.cfg and uncomment the lines for server 2 & 3 in the zookeeper ensemble section. Add the ip addresses in for server  1, 2 & 3 paying particular attention to this and ensure you know which server is which! My configuration looks like this:
# zookeeper ensemble
server.1=localhost:2191:3888
server.2=x.x.x.x:2191:3888
server.3=x.x.x.x:2191:3888
Remember, localhost will be different on each machine you do this so don’t just copy it from one instance to the other!

  •  Use the zkcli.sh script to upload the Solr configuration to Zookeeper by issuing the same commands we did back in 2.1.4 

/home/ec2-user/solr-4.10.3/example/scripts/cloud-scripts/zkcli.sh -zkhost 127.0.0.1:2181 -cmd upconfig -d /var/lib/solr/collection1/conf/ -n collection1_conf
/home/ec2-user/solr-4.10.3/example/scripts/cloud-scripts/zkcli.sh -zkhost 127.0.0.1:2181 -cmd linkconfig -collection collection1 -confname collection1_conf -solrhome /var/lib/solr/
/home/ec2-user/solr-4.10.3/example/scripts/cloud-scripts/zkcli.sh -zkhost 127.0.0.1:2181 -cmd bootstrap -zkhost 127.0.0.1:2181 -solrhome /var/lib/solr/

  • Restart tomcat and the instance is now running and ready. Repeat all these steps in section 2.3 on each instance launched into your VPC. Tail the solr and zookeeper logs on each instance as you bring them up and watch the cluster work J 

The cluster will now look something like the image below.



Using the configuration we implemented in section 2.2 means you won’t be able to connect to the Solr admin page via the load balancer. However, feel free to open up the security group to the outside but remember; this means your Solr implementation is open to the world. If you do, I suggest confining it to your own IP address. You could also just launch your instances into EC2 Classic rather than a VPC and set them behind a public load balancer, again remember to detail your own IP address in your security group rather than the entire Internet.  Browse to the admin page using the public domain of the load balancer and you’ll see something like this.

SolrCloud running 3 instances


2.4 AWS Monitoring

Now our SolrCloud, Zookeeper Ensemble cloud is up and running we should think about adding some monitoring. We don’t autoscale the instances behind our load balancer simply because we need to configure each instance in the cluster using the private IP addresses of each instance. We don’t know them until we launch the instance and although there may be some good potential of future work to deal with this; we can’t set that configuration automatically when an instance comes up. It’s easy enough to just set some monitors on our cluster so should we lose an instance we can be notified and manually bring another up without any loss of service.

Without going into great detail about AWS monitors all I will say is configure some alarms on the VPC load balancer to send an yourself an email should the number of instance drop below 3. There are many other metrics to which you could attach alarms and it’s up to you how much you want to do. 

3.0 Future work

I’m sure after going through this you’ll see potential to change and improve some things. Here’s a few things I would like to do to improve performance and resilience of the service.

3.1 Use multiple subnets in different availability zones

We only created one subnet within our VPC. This means all the instances we launch are in the same availability zone. Should anything happen to that data centre then we will lose our Solr service. The idea of SolrCloud is to provide greater resilience and get away from single point failure. To that end it might be prudent to create three private subnets in each availability zone and launch one instance in each. We could then scale the cluster up to two or three instance in each zone giving improved performance and resilience.

3.2 Auto Setting IPs when launching instances

As I mentioned before our cluster is fairly static and we cannot auto scale the instances behind the load balancer. Although that’s not a great issue it would be really nice if the entire system was completely passive and scaled up and down automatically. I’m not 100% certain on how to use subnet CIDRs but it could be possible to either define all the possible IP addresses in our configuration and create one generic image which could then be used in the autoscaling script, or get the auto launched instance to receive a predefined private IP specified in the launch configuration.

1 comment: