Quantcast
Viewing all articles
Browse latest Browse all 15

Call for best practice: Talking to r/o slaves through a load-balancer

I am looking for people who have a bunch of r/o slaves running, and who are using a load balancer to distribute queries across them.

The typical setup would be a PHP or Perl type of deployment with transient connections which end at the end of the page generation, and where a reconnect is being made at the next request serviced. The connect would go to the load balancer, which will forward it to any suitable database in the pool.

I am looking for people who are actually deploying this, and what strategies they have to cope with potential problems. I also would like to better understand what common problems are they needed to address.

Things I can imagine from the top of my head:

- Slave lag. Slave lag can happen on single boxes due to individual failures (battery on raid controller expires) or many boxes (ALTER TABLE logjams hierarchy). In the latter case boxes cannot be dropped from the load balancer lest you end up with an empty pool.

- Identifying problematic machines and isolating faults. At the moment, problematic machines sending requests are easily identified: We can SHOW PROCESSLIST, see the problem query, and the host and port it is coming from. We can find that, lsof on the offending source machine and see what the process is. With an LB inbetween we do lose this ability, unless we do fearful layer 2 magic at the LB. How do you identify sources of disruption elegantly and find them to take them out?

- What is a good pool size? We can unify any number of cells up to an entire data centers capacity from individual cells into one single supercell, but we think that this may be too big a setup. What are sizing guidelines to be used here?

What else am I missing here?

Viewing all articles
Browse latest Browse all 15

Trending Articles