The Book of Xen

Chapter 8

Prev List Next

Xen-sh.e.l.l Although we like our home brew approach to management as described in Chapter7 Chapter7, there are other options for those who believe that software can be perfectly adequate even if they didn"t personally write it. Of these, our favorite is the Xen-sh.e.l.l. It"s by the author of Xen-tools and is another example of his no-frills approach. We recommend it, not merely because we like Xen-tools or even because it"s got a good feature set, because it doesn"t have a giant list of dependencies. It"s a simple product that does its job well.

It"s available at After you"ve downloaded it, use the standard unpack && make install process to install it.

At this point there"s some configuration that needs to be done. Xen-sh.e.l.l works by taking user commands and running sudo xm sudo xm in response to their input. You"ll need to put in response to their input. You"ll need to put xm xm into their path or, conversely, alter their path to include into their path or, conversely, alter their path to include xm xm. We took the former approach: #ln-s/usr/sbin/xm/usr/bin We also need to configure /etc/sudoers /etc/sudoers to make sure that the users are allowed to use to make sure that the users are allowed to use sudo sudo to run to run xm xm on their domain (and only their domain). This entails quite a number of additions to the file, one for each command we want to allow: on their domain (and only their domain). This entails quite a number of additions to the file, one for each command we want to allow: marloweALL=NOPa.s.sWD:/usr/sbin/xmcreategoneril marloweALL=NOPa.s.sWD:/usr/sbin/xmcreate-cgoneril marloweALL=NOPa.s.sWD:/usr/sbin/xmdestroygoneril marloweALL=NOPa.s.sWD:/usr/sbin/xmshutdowngoneril marloweALL=NOPa.s.sWD:/usr/sbin/xmlistgoneril marloweALL=NOPa.s.sWD:/usr/sbin/xmconsolegoneril marloweALL=NOPa.s.sWD:/usr/sbin/xmrebootgoneril Then change the sh.e.l.l of the appropriate users to the Xen-sh.e.l.l. For example: #chsh-s/usr/local/bin/xen-login-sh.e.l.lmarlowe To mark that a user is allowed to administer a domain, simply add the user to a line in the domain config file-an elegant and ingenious solution. We"ll use the domain goneril goneril as an example: as an example: name="goneril"

xen-sh.e.l.l="marlowe"

Now, when marlowe logs in, he"ll be presented with the Xen-sh.e.l.l interface from which he can execute various commands (get a list by typing Help Help).

NoteAlthough Xen-sh.e.l.l reads domain config files to find which domains can be administered by a user, it doesn"t actually keep track of the config file"s name, as of this writing. Domain configuration filenames, to work with Xen-sh.e.l.l"s boot boot command, must be of the form command, must be of the form .cfg .cfg. Thus, goneril"s config file must be Thus, goneril"s config file must be /etc/xen/goneril.cfg. /etc/xen/goneril.cfg.

To extend the example, let"s say that marlowe can administer multiple domains. Simply add the username to both domains, and use the control control command in Xen-sh.e.l.l to switch between them. One of the niceties of Xen-sh.e.l.l is that the command only shows up if it"s necessary. command in Xen-sh.e.l.l to switch between them. One of the niceties of Xen-sh.e.l.l is that the command only shows up if it"s necessary.

xen-sh.e.l.l[goneril]>controlregan Controlling:regan xen-sh.e.l.l[regan]> Convenient, isn"t it?

Really, though, this is just the beginning. The client software for Xen is still in turmoil, with constant development by multiple factions.

You may have noticed that we"ve left out a couple of prominent frontends. For one, we haven"t even mentioned Citrix"s offering because we cover it in Chapter11 Chapter11. We also haven"t addressed Amazon"s EC2, which is probably the nearest thing to utility computing utility computing at present. As always, there"s a big field of tools out there, and we"re just aiming to make them seem manageable and talk about what works for us. at present. As always, there"s a big field of tools out there, and we"re just aiming to make them seem manageable and talk about what works for us.

Chapter7.HOSTING UNTRUSTED USERS UNDER XEN: LESSONS FROM THE TRENCHES

Now that we"ve gone over the basics of Xen administration-storage, networking, provisioning, and management-let"s look at applying these basics in practice. This chapter is mostly a case study of our VPS hosting firm, prgmr.com, and the lessons we"ve learned from renting Xen instances to the public.

The most important lesson of public Xen hosting is that the users can"t be trusted to cooperate with you or each other. Some people will always try to seize as much as they can. Our focus will be on preventing this tragedy of the commons.

Advantages for the Users There"s exactly one basic reason that a user would want to use a Xen VPS rather than paying to colocate a box in your data center: it"s cheap, especially for someone who"s just interested in some basic services, rather than ma.s.sive raw performance.

GRID COMPUTING AND VIRTUALIZATIONOne term that you hear fairly often in connection with Xen is grid computing grid computing. The basic idea behind grid computing is that you can quickly and automatically provision and destroy nodes. Amazon"s EC2 service is a good example of a grid computing platform that allows you to rent Linux servers by the hour.Grid computing doesn"t require virtualization, but the two concepts are fairly closely linked. One could design a system using physical machines and PXEboot for fast, easy, automated provisioning without using Xen, but a virtualization system would make the setup more lightweight, agile, and efficient.There are several open source projects that are attempting to create a standard and open interface to provision "grid computing" resources. One such project is Eucalyptus ( We feel that standard frameworks like this-that allow you to easily switch between grid computing providers-are essential if "the grid" is to survive.

Xen also gives users nearly all the advantages they"d get from colocating a box: their own publicly routed network interface, their own disk, root access, and so forth. With a 128MB VM, they can run DNS, light mail service, a web server, IRC, SSH, and so on. For lightweight services like these, the power of the box is much less important than its basic existence-just having something available and publicly accessible makes life more convenient.

You also have the basic advantages of virtualization, namely, that hosting one server with 32GB of RAM is a whole lot cheaper than hosting 32 servers with 1GB of RAM each (or even 4 servers with 8GB RAM each). In fact, the price of RAM being what it is, I would argue that it"s difficult to even economically justify hosting a general-purpose server with less than 32GB of RAM.

The last important feature of Xen is that, relative to other virtualization systems, it"s got a good combination of light weight, strong part.i.tioning, and robust resource controls. Unlike some other virtualization options, it"s consistent-a user can rely on getting exactly the amount of memory, disk s.p.a.ce, and network bandwidth that he"s signed up for and approximately as much CPU and disk bandwidth.

Shared Resources and Protecting Them from the Users Xen"s design is congruent to good security.-Tavis Ormandy It"s a ringing endors.e.m.e.nt, by security-boffin standards. By and large, with Xen, we"re not worried about keeping people from breaking out of their virtual machines-Xen itself is supposed to provide an appropriate level of isolation. In paravirtualized mode, Xen doesn"t expose hardware drivers to domUs, which eliminates one major attack vector.[39] For the most part, securing a dom0 is exactly like securing any other server, except in one area. For the most part, securing a dom0 is exactly like securing any other server, except in one area.

That area of possible concern is in the access controls for shared resources, which are not entirely foolproof. The primary worry is that malicious users could gain more resources than they"re ent.i.tled to, or in extreme cases cause denial-of-service attacks by exploiting flaws in Xen"s accounting. In other words, we are in the business of enforcing performance isolation, rather than specifically trying to protect the dom0 from attacks via the domUs.

Most of the resource controls that we present here are aimed at users who aren"t necessarily malicious-just, perhaps, exuberant.

Tuning CPU Usage The first shared resource of interest is the CPU. While memory and disk size are easy to tune-you can just specify memory in the config file, while disk size is determined by the size of the backing device-fine-grained CPU allocation requires you to adjust the scheduler.

Scheduler Basics The Xen scheduler acts as a referee between the running domains. In some ways it"s a lot like the Linux scheduler: It can preempt processes as needed, it tries its best to ensure fair allocation, and it ensures that the CPU wastes as few cycles as possible. As the name suggests, Xen"s scheduler schedules domains to run on the physical CPU. These domains, in turn, schedule and run processes from their internal run queues.

Because the dom0 is just another domain as far as Xen"s concerned, it"s subject to the same scheduling algorithm as the domUs. This can lead to trouble if it"s not a.s.signed a high enough weight because the dom0 has to be able to respond to I/O requests. We"ll go into more detail on that topic a bit later, after we describe the general procedures for adjusting domain weights.

Xen can use a variety of scheduling algorithms, ranging from the simple to the baroque. Although Xen has shipped with a number of schedulers in the past, we"re going to concentrate on the credit scheduler credit scheduler; it"s the current default and recommended choice and the only one that the Xen team has indicated any interest in keeping.

The xm dmesg xm dmesg command will tell you, among other things, what scheduler Xen is using. command will tell you, among other things, what scheduler Xen is using.

#xmdmesggrepscheduler (XEN)Usingscheduler:SMPCreditScheduler(credit) If you want to change the scheduler, you can set it as a boot parameter-to change to the SEDF scheduler, for example, append sched=sedf sched=sedf to the kernel line in GRUB. (That"s the Xen kernel, not the dom0 Linux kernel loaded by the first to the kernel line in GRUB. (That"s the Xen kernel, not the dom0 Linux kernel loaded by the first module module line.) line.) VCPUs and Physical CPUs For convenience, we consider each Xen domain to have one or more virtual CPUs (VCPUs), which periodically run on the physical CPUs. These are the ent.i.ties that consume credits when run. To examine VCPUs, use xm vcpu-list xm vcpu-list : : #xmvcpu-listhoratio NameIDVCPUsCPUStateTime(s)CPUAffinity horatio1600---140005.6anycpu horatio1612r--139968.3anycpu In this case, the domain has two VCPUs, 0 and 1. VCPU 1 is in the running running state on (physical) CPU 1. Note that Xen will try to spread VCPUs across CPUs as much as possible. Unless you"ve pinned them manually, VCPUs can occasionally switch CPUs, depending on which physical CPUs are available. state on (physical) CPU 1. Note that Xen will try to spread VCPUs across CPUs as much as possible. Unless you"ve pinned them manually, VCPUs can occasionally switch CPUs, depending on which physical CPUs are available.

To specify the number of VCPUs for a domain, specify the vcpus= vcpus= directive in the config file. You can also change the number of VCPUs while a domain is running using directive in the config file. You can also change the number of VCPUs while a domain is running using xm vcpu-set xm vcpu-set. However, note that you can decrease the number of VCPUs this way, but you can"t increase the number of VCPUs beyond the initial count.

To set the CPU affinity, use xm vcpu-pin

xm vcpu-pin

. For example, to switch the CPU a.s.signment in the domain horatio horatio, so that VCPU0 runs on CPU2 and VCPU1 runs on CPU0: #xmvcpu-pinhoratio02 #xmvcpu-pinhoratio10 Equivalently, you can pin VCPUs in the domain config file (/etc/xen/horatio, if you"re using our standard naming convention) like this: vcpus=2 cpus=[0,2]

This gives the domain two VCPUs, pins the first VCPU to the first physical CPU, and pins the second VCPU to the third physical CPU.

Credit Scheduler The Xen team designed the credit scheduler to minimize wasted CPU time. This makes it a work-conserving work-conserving scheduler, in that it tries to ensure that the CPU will always be working whenever there is work for it to do. scheduler, in that it tries to ensure that the CPU will always be working whenever there is work for it to do.

As a consequence, if there is more real CPU available than the domUs are demanding, all domUs get all the CPU they want. When there is contention-that is, when the domUs in aggregate want more CPU than actually exists-then the scheduler arbitrates fairly between the domains that want CPU.

Xen does its best to do a fair division, but the scheduling isn"t perfect by any stretch of the imagination. In particular, cycles spent servicing I/O by domain 0 are not charged to the responsible domain, leading to situations where I/O-intensive clients get a disproportionate share of CPU usage. Nonetheless, you can get pretty good allocation in nonpathological cases. (Also, in our experience, the CPU sits idle most of the time anyway.) The credit scheduler a.s.signs each domain a weight weight and, optionally, a and, optionally, a cap cap. The weight indicates the relative CPU allocation of a domain-if the CPU is scarce, a domain with a weight of 512 will receive twice as much CPU time as a domain with a weight of 256 (the default). The cap sets an absolute limit on the amount of CPU time a domain can use, expressed in hundredths of a CPU. Note that the CPU cap can exceed 100 on multiprocessor hosts.

The scheduler transforms the weight into a credit credit allocation for each VCPU, using a separate accounting thread. As a VCPU runs, it consumes credits. If a VCPU runs out of credits, it only runs when other, more thrifty VCPUs have finished executing, as shown in allocation for each VCPU, using a separate accounting thread. As a VCPU runs, it consumes credits. If a VCPU runs out of credits, it only runs when other, more thrifty VCPUs have finished executing, as shown in Figure7-1 Figure7-1. Periodically, the accounting thread goes through and gives everybody more credits.

Figure7-1.VCPUs wait in two queues: one for VCPUs with credits and the other for those that are over their allotment. Once the first queue is exhausted, the CPU will pull from the second.

In this case, the details are probably less important than the practical application. Using the xm sched-credit xm sched-credit commands, we can adjust CPU allocation on a per-domain basis. For example, here we"ll increase a domain"s CPU allocation. First, to list the weight and cap for the domain horatio: commands, we can adjust CPU allocation on a per-domain basis. For example, here we"ll increase a domain"s CPU allocation. First, to list the weight and cap for the domain horatio: #xmsched-credit-dhoratio {"cap":0,"weight":256} Then, to modify the scheduler"s parameters: #xmsched-credit-dhoratio-w512 #xmsched-credit-dhoratio {"cap":0,"weight":512} Of course, the value "512" only has meaning relative to the other domains that are running on the machine. Make sure to set all the domains" weights appropriately.

To set the cap for a domain: #xmsched-credit-ddomain-ccap Scheduling for Providers We decided to divide the CPU along the same lines as the available RAM-it stands to reason that a user paying for half the RAM in a box will want more CPU than someone with a 64MB domain. Thus, in our setup, a customer with 25 percent of the RAM also has a minimum share of 25 percent of the CPU cycles.

The simple way to do this is to a.s.sign each CPU a weight equal to the number of megabytes of memory it has and leave the cap empty. The scheduler will then handle converting that into fair proportions. For example, our aforementioned user with half the RAM will get about as much CPU time as the rest of the users put together.

Of course, that"s the worst case; that is what the user will get in an environment of constant struggle for the CPU. Idle domains will automatically yield the CPU. If all domains but one are idle, that one can have the entire CPU to itself.

NoteIt"s essential to make sure that the dom0 has sufficient CPU to service I/O requests. You can handle this by dedicating a CPU to the dom0 or by giving the dom0 a very high weight-high enough to ensure that it never runs out of credits. At prgmr.com, we handle the problem by weighting each domU with its RAM amount and weighting the dom0 at 6000.

This simple weight = memory formula becomes a bit more complex when dealing with multiprocessor systems because independent systems of CPU allocation come into play. A good rule would be to allocate VCPUs in proportion to memory (and therefore in proportion to weight). For example, a domain with half the RAM on a box with four cores (and hyperthreading turned off) should have at least two VCPUs. Another solution would be to give all domains as many VCPUs as physical processors in the box-this would allow all domains to burst to the full CPU capacity of the physical machine but might lead to increased overhead from context swaps.

Controlling Network Resources Network resource controls are, frankly, essential to any kind of shared hosting operation. Among the many lessons that we"ve learned from Xen hosting has been that if you provide free bandwidth, some users will exploit it for all it"s worth. This isn"t a Xen-specific observation, but it"s especially noticeable with the sort of cheap VPS hosting Xen lends itself to.

We prefer to use network-bridge network-bridge, since that"s the default. For a more thorough look at network-bridge network-bridge, take a look at Chapter5 Chapter5.

Monitoring Network Usage Given that some users will consume as much bandwidth as possible, it"s vital to have some way to monitor network traffic.[40]

To monitor network usage, we use BandwidthD on a physical SPAN port. It"s a simple tool that counts bytes going through a switch-nothing Xen-specific here. We feel comfortable doing this because our provider doesn"t allow anything but IP packets in or out, and our antispoof rules are good enough to protect us from users spoofing their IP on outgoing packets.

A similar approach would be to extend the dom0 is a switch dom0 is a switch a.n.a.logy and use SNMP monitoring software. As mentioned in a.n.a.logy and use SNMP monitoring software. As mentioned in Chapter5 Chapter5, it"s important to specify a vifname vifname for each domain if you"re doing this. In any case, we"ll leave the particulars of bandwidth monitoring up to you. for each domain if you"re doing this. In any case, we"ll leave the particulars of bandwidth monitoring up to you.

ARP CACHE POISONINGIf you use the default network-bridge setup network-bridge setup, you are vulnerable to ARP cache poisoning, just as on any layer 2 switch.The idea is that the interface counters on a layer 2 switch-such as the virtual switch used by network-bridge network-bridge-watch traffic as it pa.s.ses through a particular port. Every time a switch sees an Ethernet frame or ARP is-at, it keeps track of what port and MAC it came from. If it gets a frame destined for a MAC address in its cache, it sends that frame down the proper port (and only the proper port). If the bridge sees a frame destined for a MAC that is not in the cache, it sends that frame to all ports.[41]Clever, no? In most cases this means that you almost never see Ethernet frames destined for other MAC addresses (other than broadcasts, etc.). However, this feature is designed purely as an optimization, not a security measure. As those of you with cable providers who do MAC address verification know quite well, it is fairly trivial to fake a MAC address. This means that a malicious user can fill the (limited in size) ARP cache with bogus MAC addresses, drive out the good data, and force all packets to go down all interfaces. At this point the switch becomes basically a hub, and the counters on all ports will show all traffic for any port.There are two ways we have worked around the problem. You could use Xen"s network-route network-route networking model, which doesn"t use a virtual bridge. The other approach is to ignore the interface counters and use something like BandwidthD, which bases its accounting on IP packets. networking model, which doesn"t use a virtual bridge. The other approach is to ignore the interface counters and use something like BandwidthD, which bases its accounting on IP packets.

Once you can examine traffic quickly, the next step is to shape the users. The principles for network traffic shaping and policing are the same as for standalone boxes, except that you can also implement policies on the Xen host. Let"s look at how to limit both incoming and outgoing traffic for a particular interface-as if, say, you have a customer who"s going over his bandwidth allotment.

Network Shaping Principles The first thing to know about shaping is that it only works on outgoing traffic. Although it is possible to police police incoming traffic, it isn"t as effective. Fortunately, both directions look like outgoing traffic at some point in their pa.s.sage through the dom0, as shown in incoming traffic, it isn"t as effective. Fortunately, both directions look like outgoing traffic at some point in their pa.s.sage through the dom0, as shown in Figure7-2 Figure7-2. (When we refer to outgoing and incoming traffic in the following description, we mean from the perspective of the domU.) Figure7-2.Incoming traffic comes from the Internet, goes through the virtual bridge, and gets shaped by a simple nonhierarchical filter. Outgoing traffic, on the other hand, needs to go through a system of filters that a.s.sign packets to cla.s.ses in a hierarchical queuing discipline.

Shaping Incoming Traffic We"ll start with incoming traffic because it"s much simpler to limit than outgoing traffic. The easiest way to shape incoming traffic is probably the token bucket filter token bucket filter queuing discipline, which is a simple, effective, and lightweight way to slow down an interface. queuing discipline, which is a simple, effective, and lightweight way to slow down an interface.

The token bucket filter, or TBF, takes its name from the metaphor of a bucket of tokens. Tokens stream into the bucket at a defined and constant rate. Each byte of data sent takes one token from the bucket and goes out immediately-when the bucket"s empty, data can only go as tokens come in. The bucket itself has a limited capacity, which guarantees that only a reasonable amount of data will be sent out at once. To use the TBF, we add a qdisc qdisc ( (queuing discipline) to perform the actual work of traffic limiting. To limit the virtual interface osric osric to 1 megabit per second, with bursts up to 2 megabits and maximum allowable latency of 50 milliseconds: to 1 megabit per second, with bursts up to 2 megabits and maximum allowable latency of 50 milliseconds: #tcqdiscadddevosricroottbfrate1mbitlatency50mspeakrate2mbitmaxburst40MB This adds a qdisc qdisc to the device to the device osric osric. The next arguments specify where to add it (root) and what sort of qdisc qdisc it is ( it is (tbf). Finally, we specify the rate, latency, burst rate, and amount that can go at burst rate. These parameters correspond to the token flow, amount of latency the packets are allowed to have (before the driver signals the operating system that its buffers are full), maximum rate at which the bucket can empty, and the size of the bucket.

Shaping Outgoing Traffic Having shaped incoming traffic, we can focus on limiting outgoing traffic. This is a bit more complex because the outgoing traffic for all domains goes through a single interface, so a single token bucket won"t work. The policing filters might work, but they handle the problem by dropping packets, which is ... bad. Instead, we"re going to apply traffic shaping to the outgoing physical Ethernet device, peth0, with a Hierarchical Token Bucket Hierarchical Token Bucket, or HTB qdisc qdisc.

The HTB discipline acts like the simple token bucket, but with a hierarchy of buckets, each with its own rate, and a system of filters to a.s.sign packets to buckets. Here"s how to set it up.

First, we have to make sure that the packets on Xen"s virtual bridge traverse iptables iptables: #echo1>/proc/sys/net/bridge/bridge-nf-call-iptables This is so that we can mark packets according to which domU emitted them. There are other reasons, but that"s the important one in terms of our traffic-shaping setup. Next, for each domU, we add a rule to mark packets from the corresponding network interface: #iptables-tmangle-AFORWARD-mphysdev--physdev-inbaldr-jMARK--set-mark5 Here the number 5 is an arbitrary mark-it"s not important what the number is, as long as there"s a useful mapping between number and domain. We"re using the domain ID. We could also use tc tc filters directly that match on source IP address, but it feels more elegant to have everything keyed to the domain"s physical network device. Note that we"re using filters directly that match on source IP address, but it feels more elegant to have everything keyed to the domain"s physical network device. Note that we"re using physdev-in physdev-in-traffic that goes out from the domU comes in to the dom0, as Figure7-3 Figure7-3 shows. shows.

Figure7-3.We shape traffic coming into the domU as it comes into the dom0 from the physical device, and shape traffic leaving the domU as it enters the dom0 on the virtual device.

Next we create a HTB qdisc qdisc. We won"t go over the HTB options in too much detail-see the doc.u.mentation at ~devik/qos/htb/manual/userg.htm for more details: for more details: #tcqdiscadddevpeth0roothandle1:htbdefault12 Then we make some cla.s.ses to put traffic into. Each cla.s.s will get traffic from one domU. (As the HTB docs explain, we"re also making a parent cla.s.s so that they can share surplus bandwidth.) #tccla.s.sadddevpeth0parent1:cla.s.sid1:1htbrate100mbit #tccla.s.sadddevpeth0parent1:1cla.s.sid1:2htbrate1mbit Now that we have a cla.s.s for our domU"s traffic, we need a filter that will a.s.sign packets to it.

#tcfilteradddevpeth0protocolipparent1:0prio1handle5fwflowid1:2 Note that we"re matching on the "handle" that we set earlier using iptables iptables. This a.s.signs the packet to the 1:2 cla.s.s, which we"ve previously limited to 1 megabit per second.

At this point traffic to and from the target domU is essentially shaped, as demonstrated by Figure7-4 Figure7-4. You can easily add commands like these to the end of your vif vif script, be it script, be it vif-bridge vif-bridge, vif-route vif-route, or a wrapper. We would also like to emphasize that this is only an example and that the Linux Advanced Routing and Traffic Control how-to at is an excellent place to look for further doc.u.mentation. The is an excellent place to look for further doc.u.mentation. The tc tc man page is also informative. man page is also informative.

Figure7-4.The effect of the shaping filters

[39] In HVM mode, the emulated QEMU devices are something of a risk, which is part of why we don"t offer HVM domains. In HVM mode, the emulated QEMU devices are something of a risk, which is part of why we don"t offer HVM domains.

[40] In this case, we"re talking about bandwidth monitoring. You should also run some sort of IDS, such as Snort, to watch for outgoing abuse (we do) but there"s nothing Xen-specific about that. In this case, we"re talking about bandwidth monitoring. You should also run some sort of IDS, such as Snort, to watch for outgoing abuse (we do) but there"s nothing Xen-specific about that.

[41] We are using the words We are using the words port port and and interface interface interchangeably here. This is a reasonable simplification in the context of interface counters on an SNMP-capable switch. interchangeably here. This is a reasonable simplification in the context of interface counters on an SNMP-capable switch.

Storage in a Shared Hosting Environment As with so much else in system administration, a bit of planning can save a lot of trouble. Figure out beforehand where you"re going to store pristine filesystem images, where configuration files go, and where customer data will live.

For pristine images, there are a lot of conventions-some people use /diskimages /diskimages, some use /opt/xen, /var/xen /opt/xen, /var/xen or similar, some use a subdirectory of or similar, some use a subdirectory of /home /home. Pick one and stick with it.

Configuration files should, without exception, go in /etc/xen /etc/xen. If you don"t give xm create xm create a full path, it"ll look for the file in a full path, it"ll look for the file in /etc/xen /etc/xen. Don"t disappoint it.

As for customer data, we recommend that serious hosting providers use LVM. This allows greater flexibility and manageability than blktap-mapped files while maintaining good performance. Chapter4 Chapter4 covers the details of working with LVM (or at least enough to get started), as well as many other available storage options and their advantages. Here we"re confining ourselves to lessons that we"ve learned from our adventures in shared hosting. covers the details of working with LVM (or at least enough to get started), as well as many other available storage options and their advantages. Here we"re confining ourselves to lessons that we"ve learned from our adventures in shared hosting.

Regulating Disk Access with ionice One common problem with VPS hosting is that customers-or your own housekeeping processes, like backups-will use enough I/O bandwidth to slow down everyone on the machine. Furthermore, I/O isn"t really affected by the scheduler tweaks discussed earlier. A domain can request data, hand off the CPU, and save its credits until it"s notified of the data"s arrival.

Although you can"t set hard limits on disk access rates as you can with the network QoS, you can use the ionice ionice command to prioritize the different domains into subcla.s.ses, with a syntax like: command to prioritize the different domains into subcla.s.ses, with a syntax like: #ionice-p

-c-n

Here -n -n is the k.n.o.b you"ll ordinarily want to twiddle. It can range from 0 to 7, with lower numbers taking precedence. is the k.n.o.b you"ll ordinarily want to twiddle. It can range from 0 to 7, with lower numbers taking precedence.

We recommend always specifying 2 for the cla.s.s. Other cla.s.ses exist-3 is idle and 1 is realtime-but idle is extremely conservative, while realtime is so aggressive as to have a good chance of locking up the system. The within-cla.s.s priority is aimed at proportional allocation, and is thus much more likely to be what you want.

Let"s look at ionice ionice in action. Here we"ll test in action. Here we"ll test ionice ionice with two different domains, one with the highest normal priority, the other with the lowest. with two different domains, one with the highest normal priority, the other with the lowest.

First, ionice ionice only works with the CFQ I/O scheduler. To check that you"re using the CFQ scheduler, run this command in the dom0: only works with the CFQ I/O scheduler. To check that you"re using the CFQ scheduler, run this command in the dom0: #cat/sys/block/[sh]d[a-z]*/queue/scheduler noopantic.i.p.atorydeadline[cfq]

noopantic.i.p.atorydeadline[cfq]

The word in brackets is the selected scheduler. If it"s not [cfq] [cfq], reboot with the parameter elevator =cfq elevator =cfq.

Next we find the processes we want to ionice ionice. Because we"re using tap:aio tap:aio devices in this example, the dom0 process is devices in this example, the dom0 process is tapdisk tapdisk. If we were using phy: phy: devices, it"d be devices, it"d be [xvd ] [xvd ].

#psauxgreptapdisk root10540.50.013588556?Sl05:450:10tapdisk /dev/xen/tapctrlwrite1/dev/xen/tapctrlread1 root11720.60.013592560?Sl05:450:10tapdisk /dev/xen/tapctrlwrite2/dev/xen/tapctrlread2 Now we can ionice ionice our domains. Note that the numbers of the our domains. Note that the numbers of the tapctrl tapctrl devices correspond to the order the domains were started in, not the domain ID. devices correspond to the order the domains were started in, not the domain ID.

#ionice-p1054-c2-n7 #ionice-p1172-c2-n0 To test ionice ionice, let"s run a couple of Bonnie++ processes and time them. (After Bonnie++ finishes, we dd dd a load file, just to make sure that conditions for the other domain remain unchanged.) a load file, just to make sure that conditions for the other domain remain unchanged.) prio7domUtmp#/usr/bin/time-vbonnie++-u1&&ddif=/dev/urandomof=load prio0domUtmp#/usr/bin/time-vbonnie++-u1&&ddif=/dev/urandomof=load In the end, according to the wall clock, the domU with priority 0 took 3:32.33 to finish, while the priority 7 domU needed 5:07.98. As you can see, the ionice ionice priorities provide an effective way to do proportional I/O allocation. priorities provide an effective way to do proportional I/O allocation.

The best way to apply ionice ionice is probably to look at CPU allocations and convert them into priority cla.s.ses. Domains with the highest CPU allocation get priority 1, next highest priority 2, and so on. Processes in the dom0 should be ioniced as appropriate. This will ensure a reasonable priority, but not allow big domUs to take over the entirety of the I/O bandwidth. is probably to look at CPU allocations and convert them into priority cla.s.ses. Domains with the highest CPU allocation get priority 1, next highest priority 2, and so on. Processes in the dom0 should be ioniced as appropriate. This will ensure a reasonable priority, but not allow big domUs to take over the entirety of the I/O bandwidth.

Backing Up DomUs As a service provider, one rapidly learns that customers don"t do their own backups. When a disk fails (not if-when if-when), customers will expect you to have complete backups of their data, and they"ll be very sad if you don"t. So let"s talk about backups.

Of course, you already have a good idea how to back up physical machines. There are two aspects to backing up Xen domains: First, there"s the domain"s virtual disk, which we want to back up just as we would a real machine"s disk. Second, there"s the domain"s running state, which can be saved and restored from the dom0. Ordinarily, our use of backup backup refers purely to the disk, as it would with physical machines, but with the advantage that we can use domain snapshots to pause the domain long enough to get a clean disk image. refers purely to the disk, as it would with physical machines, but with the advantage that we can use domain snapshots to pause the domain long enough to get a clean disk image.

We use xm save xm save and LVM snapshots to back up both the domain"s storage and running state. LVM snapshots aren"t a good way of implementing full copy-on-write because they handle the "out of snapshot s.p.a.ce" case poorly, but they"re excellent if you want to preserve a filesystem state long enough to make a consistent backup. and LVM snapshots to back up both the domain"s storage and running state. LVM snapshots aren"t a good way of implementing full copy-on-write because they handle the "out of snapshot s.p.a.ce" case poorly, but they"re excellent if you want to preserve a filesystem state long enough to make a consistent backup.

Our implementation copies the entire disk image using either a plain cp cp (in the case of file-backed domUs) or (in the case of file-backed domUs) or dd dd (for (for phy: phy: devices). This is because we very much want to avoid mounting a possibly unclean filesystem in the dom0, which can cause the entire machine to panic. Besides, if we do a raw device backup, domU administrators will be able to use filesystems (such as ZFS on an OpenSolaris domU) that the dom0 cannot read. devices). This is because we very much want to avoid mounting a possibly unclean filesystem in the dom0, which can cause the entire machine to panic. Besides, if we do a raw device backup, domU administrators will be able to use filesystems (such as ZFS on an OpenSolaris domU) that the dom0 cannot read.

An appropriate script to do as we"ve described might be: #!/usr/bin/perl ,@stores,@files,@lvs;

$domain=$ARGV[0];

my$destdir="/var/backup/xen/${domain}/"; system"mkdir-p$destdir";

open(FILE,"/etc/xen/$domain"); while(){ if(m/^disk/){ s/.*[s+([^]]+)s*].*/1/; @disks=split(/[,]/);

#discardelementswithouta:,sincetheycan"tbe #backingstorespecifiers while($disks[$n]){ $disks[$n]=~s/[""]//g; push(@stores,"$disks[$n]")if("$disks[$n]"=~m/:/); $n++; } $n=0;

#spliton:andtakeonlythelastfieldifthefirst #isarecognizeddevicespecifier.

while($stores[$n]){ @tmp=split(/:/,$stores[$n]); if(($tmp[0]=~m/file/i)($tmp[0]=~m/tap/i)){ push(@files,$tmp[$#tmp]); } elsif($tmp[0]=~m/phy/i){ push(@lvs,$tmp[$#tmp]); } $n++; } } } closeFILE;

Prev List Next