. For example, to switch the CPU a.s.signment in the domain horatio horatio, so that VCPU0 runs on CPU2 and VCPU1 runs on CPU0: #xmvcpu-pinhoratio02 #xmvcpu-pinhoratio10 Equivalently, you can pin VCPUs in the domain config file (/etc/xen/horatio, if you"re using our standard naming convention) like this: vcpus=2 cpus=[0,2]
This gives the domain two VCPUs, pins the first VCPU to the first physical CPU, and pins the second VCPU to the third physical CPU.
Credit Scheduler The Xen team designed the credit scheduler to minimize wasted CPU time. This makes it a work-conserving work-conserving scheduler, in that it tries to ensure that the CPU will always be working whenever there is work for it to do. scheduler, in that it tries to ensure that the CPU will always be working whenever there is work for it to do.
As a consequence, if there is more real CPU available than the domUs are demanding, all domUs get all the CPU they want. When there is contention-that is, when the domUs in aggregate want more CPU than actually exists-then the scheduler arbitrates fairly between the domains that want CPU.
Xen does its best to do a fair division, but the scheduling isn"t perfect by any stretch of the imagination. In particular, cycles spent servicing I/O by domain 0 are not charged to the responsible domain, leading to situations where I/O-intensive clients get a disproportionate share of CPU usage. Nonetheless, you can get pretty good allocation in nonpathological cases. (Also, in our experience, the CPU sits idle most of the time anyway.) The credit scheduler a.s.signs each domain a weight weight and, optionally, a and, optionally, a cap cap. The weight indicates the relative CPU allocation of a domain-if the CPU is scarce, a domain with a weight of 512 will receive twice as much CPU time as a domain with a weight of 256 (the default). The cap sets an absolute limit on the amount of CPU time a domain can use, expressed in hundredths of a CPU. Note that the CPU cap can exceed 100 on multiprocessor hosts.
The scheduler transforms the weight into a credit credit allocation for each VCPU, using a separate accounting thread. As a VCPU runs, it consumes credits. If a VCPU runs out of credits, it only runs when other, more thrifty VCPUs have finished executing, as shown in allocation for each VCPU, using a separate accounting thread. As a VCPU runs, it consumes credits. If a VCPU runs out of credits, it only runs when other, more thrifty VCPUs have finished executing, as shown in Figure7-1 Figure7-1. Periodically, the accounting thread goes through and gives everybody more credits.
Figure7-1.VCPUs wait in two queues: one for VCPUs with credits and the other for those that are over their allotment. Once the first queue is exhausted, the CPU will pull from the second.
In this case, the details are probably less important than the practical application. Using the xm sched-credit xm sched-credit commands, we can adjust CPU allocation on a per-domain basis. For example, here we"ll increase a domain"s CPU allocation. First, to list the weight and cap for the domain horatio: commands, we can adjust CPU allocation on a per-domain basis. For example, here we"ll increase a domain"s CPU allocation. First, to list the weight and cap for the domain horatio: #xmsched-credit-dhoratio {"cap":0,"weight":256} Then, to modify the scheduler"s parameters: #xmsched-credit-dhoratio-w512 #xmsched-credit-dhoratio {"cap":0,"weight":512} Of course, the value "512" only has meaning relative to the other domains that are running on the machine. Make sure to set all the domains" weights appropriately.
To set the cap for a domain: #xmsched-credit-ddomain-ccap Scheduling for Providers We decided to divide the CPU along the same lines as the available RAM-it stands to reason that a user paying for half the RAM in a box will want more CPU than someone with a 64MB domain. Thus, in our setup, a customer with 25 percent of the RAM also has a minimum share of 25 percent of the CPU cycles.
The simple way to do this is to a.s.sign each CPU a weight equal to the number of megabytes of memory it has and leave the cap empty. The scheduler will then handle converting that into fair proportions. For example, our aforementioned user with half the RAM will get about as much CPU time as the rest of the users put together.
Of course, that"s the worst case; that is what the user will get in an environment of constant struggle for the CPU. Idle domains will automatically yield the CPU. If all domains but one are idle, that one can have the entire CPU to itself.
NoteIt"s essential to make sure that the dom0 has sufficient CPU to service I/O requests. You can handle this by dedicating a CPU to the dom0 or by giving the dom0 a very high weight-high enough to ensure that it never runs out of credits. At prgmr.com, we handle the problem by weighting each domU with its RAM amount and weighting the dom0 at 6000.
This simple weight = memory formula becomes a bit more complex when dealing with multiprocessor systems because independent systems of CPU allocation come into play. A good rule would be to allocate VCPUs in proportion to memory (and therefore in proportion to weight). For example, a domain with half the RAM on a box with four cores (and hyperthreading turned off) should have at least two VCPUs. Another solution would be to give all domains as many VCPUs as physical processors in the box-this would allow all domains to burst to the full CPU capacity of the physical machine but might lead to increased overhead from context swaps.
Controlling Network Resources Network resource controls are, frankly, essential to any kind of shared hosting operation. Among the many lessons that we"ve learned from Xen hosting has been that if you provide free bandwidth, some users will exploit it for all it"s worth. This isn"t a Xen-specific observation, but it"s especially noticeable with the sort of cheap VPS hosting Xen lends itself to.
We prefer to use network-bridge network-bridge, since that"s the default. For a more thorough look at network-bridge network-bridge, take a look at Chapter5 Chapter5.
Monitoring Network Usage Given that some users will consume as much bandwidth as possible, it"s vital to have some way to monitor network traffic.[40]
To monitor network usage, we use BandwidthD on a physical SPAN port. It"s a simple tool that counts bytes going through a switch-nothing Xen-specific here. We feel comfortable doing this because our provider doesn"t allow anything but IP packets in or out, and our antispoof rules are good enough to protect us from users spoofing their IP on outgoing packets.
A similar approach would be to extend the dom0 is a switch dom0 is a switch a.n.a.logy and use SNMP monitoring software. As mentioned in a.n.a.logy and use SNMP monitoring software. As mentioned in Chapter5 Chapter5, it"s important to specify a vifname vifname for each domain if you"re doing this. In any case, we"ll leave the particulars of bandwidth monitoring up to you. for each domain if you"re doing this. In any case, we"ll leave the particulars of bandwidth monitoring up to you.
ARP CACHE POISONINGIf you use the default network-bridge setup network-bridge setup, you are vulnerable to ARP cache poisoning, just as on any layer 2 switch.The idea is that the interface counters on a layer 2 switch-such as the virtual switch used by network-bridge network-bridge-watch traffic as it pa.s.ses through a particular port. Every time a switch sees an Ethernet frame or ARP is-at, it keeps track of what port and MAC it came from. If it gets a frame destined for a MAC address in its cache, it sends that frame down the proper port (and only the proper port). If the bridge sees a frame destined for a MAC that is not in the cache, it sends that frame to all ports.[41]Clever, no? In most cases this means that you almost never see Ethernet frames destined for other MAC addresses (other than broadcasts, etc.). However, this feature is designed purely as an optimization, not a security measure. As those of you with cable providers who do MAC address verification know quite well, it is fairly trivial to fake a MAC address. This means that a malicious user can fill the (limited in size) ARP cache with bogus MAC addresses, drive out the good data, and force all packets to go down all interfaces. At this point the switch becomes basically a hub, and the counters on all ports will show all traffic for any port.There are two ways we have worked around the problem. You could use Xen"s network-route network-route networking model, which doesn"t use a virtual bridge. The other approach is to ignore the interface counters and use something like BandwidthD, which bases its accounting on IP packets. networking model, which doesn"t use a virtual bridge. The other approach is to ignore the interface counters and use something like BandwidthD, which bases its accounting on IP packets.
Once you can examine traffic quickly, the next step is to shape the users. The principles for network traffic shaping and policing are the same as for standalone boxes, except that you can also implement policies on the Xen host. Let"s look at how to limit both incoming and outgoing traffic for a particular interface-as if, say, you have a customer who"s going over his bandwidth allotment.
Network Shaping Principles The first thing to know about shaping is that it only works on outgoing traffic. Although it is possible to police police incoming traffic, it isn"t as effective. Fortunately, both directions look like outgoing traffic at some point in their pa.s.sage through the dom0, as shown in incoming traffic, it isn"t as effective. Fortunately, both directions look like outgoing traffic at some point in their pa.s.sage through the dom0, as shown in Figure7-2 Figure7-2. (When we refer to outgoing and incoming traffic in the following description, we mean from the perspective of the domU.) Figure7-2.Incoming traffic comes from the Internet, goes through the virtual bridge, and gets shaped by a simple nonhierarchical filter. Outgoing traffic, on the other hand, needs to go through a system of filters that a.s.sign packets to cla.s.ses in a hierarchical queuing discipline.
Shaping Incoming Traffic We"ll start with incoming traffic because it"s much simpler to limit than outgoing traffic. The easiest way to shape incoming traffic is probably the token bucket filter token bucket filter queuing discipline, which is a simple, effective, and lightweight way to slow down an interface. queuing discipline, which is a simple, effective, and lightweight way to slow down an interface.
The token bucket filter, or TBF, takes its name from the metaphor of a bucket of tokens. Tokens stream into the bucket at a defined and constant rate. Each byte of data sent takes one token from the bucket and goes out immediately-when the bucket"s empty, data can only go as tokens come in. The bucket itself has a limited capacity, which guarantees that only a reasonable amount of data will be sent out at once. To use the TBF, we add a qdisc qdisc ( (queuing discipline) to perform the actual work of traffic limiting. To limit the virtual interface osric osric to 1 megabit per second, with bursts up to 2 megabits and maximum allowable latency of 50 milliseconds: to 1 megabit per second, with bursts up to 2 megabits and maximum allowable latency of 50 milliseconds: #tcqdiscadddevosricroottbfrate1mbitlatency50mspeakrate2mbitmaxburst40MB This adds a qdisc qdisc to the device to the device osric osric. The next arguments specify where to add it (root) and what sort of qdisc qdisc it is ( it is (tbf). Finally, we specify the rate, latency, burst rate, and amount that can go at burst rate. These parameters correspond to the token flow, amount of latency the packets are allowed to have (before the driver signals the operating system that its buffers are full), maximum rate at which the bucket can empty, and the size of the bucket.
Shaping Outgoing Traffic Having shaped incoming traffic, we can focus on limiting outgoing traffic. This is a bit more complex because the outgoing traffic for all domains goes through a single interface, so a single token bucket won"t work. The policing filters might work, but they handle the problem by dropping packets, which is ... bad. Instead, we"re going to apply traffic shaping to the outgoing physical Ethernet device, peth0, with a Hierarchical Token Bucket Hierarchical Token Bucket, or HTB qdisc qdisc.
The HTB discipline acts like the simple token bucket, but with a hierarchy of buckets, each with its own rate, and a system of filters to a.s.sign packets to buckets. Here"s how to set it up.
First, we have to make sure that the packets on Xen"s virtual bridge traverse iptables iptables: #echo1>/proc/sys/net/bridge/bridge-nf-call-iptables This is so that we can mark packets according to which domU emitted them. There are other reasons, but that"s the important one in terms of our traffic-shaping setup. Next, for each domU, we add a rule to mark packets from the corresponding network interface: #iptables-tmangle-AFORWARD-mphysdev--physdev-inbaldr-jMARK--set-mark5 Here the number 5 is an arbitrary mark-it"s not important what the number is, as long as there"s a useful mapping between number and domain. We"re using the domain ID. We could also use tc tc filters directly that match on source IP address, but it feels more elegant to have everything keyed to the domain"s physical network device. Note that we"re using filters directly that match on source IP address, but it feels more elegant to have everything keyed to the domain"s physical network device. Note that we"re using physdev-in physdev-in-traffic that goes out from the domU comes in to the dom0, as Figure7-3 Figure7-3 shows. shows.
Figure7-3.We shape traffic coming into the domU as it comes into the dom0 from the physical device, and shape traffic leaving the domU as it enters the dom0 on the virtual device.
Next we create a HTB qdisc qdisc. We won"t go over the HTB options in too much detail-see the doc.u.mentation at ~devik/qos/htb/manual/userg.htm for more details: for more details: #tcqdiscadddevpeth0roothandle1:htbdefault12 Then we make some cla.s.ses to put traffic into. Each cla.s.s will get traffic from one domU. (As the HTB docs explain, we"re also making a parent cla.s.s so that they can share surplus bandwidth.) #tccla.s.sadddevpeth0parent1:cla.s.sid1:1htbrate100mbit #tccla.s.sadddevpeth0parent1:1cla.s.sid1:2htbrate1mbit Now that we have a cla.s.s for our domU"s traffic, we need a filter that will a.s.sign packets to it.
#tcfilteradddevpeth0protocolipparent1:0prio1handle5fwflowid1:2 Note that we"re matching on the "handle" that we set earlier using iptables iptables. This a.s.signs the packet to the 1:2 cla.s.s, which we"ve previously limited to 1 megabit per second.
At this point traffic to and from the target domU is essentially shaped, as demonstrated by Figure7-4 Figure7-4. You can easily add commands like these to the end of your vif vif script, be it script, be it vif-bridge vif-bridge, vif-route vif-route, or a wrapper. We would also like to emphasize that this is only an example and that the Linux Advanced Routing and Traffic Control how-to at is an excellent place to look for further doc.u.mentation. The is an excellent place to look for further doc.u.mentation. The tc tc man page is also informative. man page is also informative.
Figure7-4.The effect of the shaping filters
[39] In HVM mode, the emulated QEMU devices are something of a risk, which is part of why we don"t offer HVM domains. In HVM mode, the emulated QEMU devices are something of a risk, which is part of why we don"t offer HVM domains.
[40] In this case, we"re talking about bandwidth monitoring. You should also run some sort of IDS, such as Snort, to watch for outgoing abuse (we do) but there"s nothing Xen-specific about that. In this case, we"re talking about bandwidth monitoring. You should also run some sort of IDS, such as Snort, to watch for outgoing abuse (we do) but there"s nothing Xen-specific about that.
[41] We are using the words We are using the words port port and and interface interface interchangeably here. This is a reasonable simplification in the context of interface counters on an SNMP-capable switch. interchangeably here. This is a reasonable simplification in the context of interface counters on an SNMP-capable switch.
Storage in a Shared Hosting Environment As with so much else in system administration, a bit of planning can save a lot of trouble. Figure out beforehand where you"re going to store pristine filesystem images, where configuration files go, and where customer data will live.
For pristine images, there are a lot of conventions-some people use /diskimages /diskimages, some use /opt/xen, /var/xen /opt/xen, /var/xen or similar, some use a subdirectory of or similar, some use a subdirectory of /home /home. Pick one and stick with it.
Configuration files should, without exception, go in /etc/xen /etc/xen. If you don"t give xm create xm create a full path, it"ll look for the file in a full path, it"ll look for the file in /etc/xen /etc/xen. Don"t disappoint it.
As for customer data, we recommend that serious hosting providers use LVM. This allows greater flexibility and manageability than blktap-mapped files while maintaining good performance. Chapter4 Chapter4 covers the details of working with LVM (or at least enough to get started), as well as many other available storage options and their advantages. Here we"re confining ourselves to lessons that we"ve learned from our adventures in shared hosting. covers the details of working with LVM (or at least enough to get started), as well as many other available storage options and their advantages. Here we"re confining ourselves to lessons that we"ve learned from our adventures in shared hosting.
Regulating Disk Access with ionice One common problem with VPS hosting is that customers-or your own housekeeping processes, like backups-will use enough I/O bandwidth to slow down everyone on the machine. Furthermore, I/O isn"t really affected by the scheduler tweaks discussed earlier. A domain can request data, hand off the CPU, and save its credits until it"s notified of the data"s arrival.
Although you can"t set hard limits on disk access rates as you can with the network QoS, you can use the ionice ionice command to prioritize the different domains into subcla.s.ses, with a syntax like: command to prioritize the different domains into subcla.s.ses, with a syntax like: #ionice-p
-c-n Here -n -n is the k.n.o.b you"ll ordinarily want to twiddle. It can range from 0 to 7, with lower numbers taking precedence. is the k.n.o.b you"ll ordinarily want to twiddle. It can range from 0 to 7, with lower numbers taking precedence.
We recommend always specifying 2 for the cla.s.s. Other cla.s.ses exist-3 is idle and 1 is realtime-but idle is extremely conservative, while realtime is so aggressive as to have a good chance of locking up the system. The within-cla.s.s priority is aimed at proportional allocation, and is thus much more likely to be what you want.
Let"s look at ionice ionice in action. Here we"ll test in action. Here we"ll test ionice ionice with two different domains, one with the highest normal priority, the other with the lowest. with two different domains, one with the highest normal priority, the other with the lowest.
First, ionice ionice only works with the CFQ I/O scheduler. To check that you"re using the CFQ scheduler, run this command in the dom0: only works with the CFQ I/O scheduler. To check that you"re using the CFQ scheduler, run this command in the dom0: #cat/sys/block/[sh]d[a-z]*/queue/scheduler noopantic.i.p.atorydeadline[cfq]
noopantic.i.p.atorydeadline[cfq]
The word in brackets is the selected scheduler. If it"s not [cfq] [cfq], reboot with the parameter elevator =cfq elevator =cfq.
Next we find the processes we want to ionice ionice. Because we"re using tap:aio tap:aio devices in this example, the dom0 process is devices in this example, the dom0 process is tapdisk tapdisk. If we were using phy: phy: devices, it"d be devices, it"d be [xvd ] [xvd ].#psauxgreptapdisk root10540.50.013588556?Sl05:450:10tapdisk /dev/xen/tapctrlwrite1/dev/xen/tapctrlread1 root11720.60.013592560?Sl05:450:10tapdisk /dev/xen/tapctrlwrite2/dev/xen/tapctrlread2 Now we can ionice ionice our domains. Note that the numbers of the our domains. Note that the numbers of the tapctrl tapctrl devices correspond to the order the domains were started in, not the domain ID. devices correspond to the order the domains were started in, not the domain ID.
#ionice-p1054-c2-n7 #ionice-p1172-c2-n0 To test ionice ionice, let"s run a couple of Bonnie++ processes and time them. (After Bonnie++ finishes, we dd dd a load file, just to make sure that conditions for the other domain remain unchanged.) a load file, just to make sure that conditions for the other domain remain unchanged.) prio7domUtmp#/usr/bin/time-vbonnie++-u1&&ddif=/dev/urandomof=load prio0domUtmp#/usr/bin/time-vbonnie++-u1&&ddif=/dev/urandomof=load In the end, according to the wall clock, the domU with priority 0 took 3:32.33 to finish, while the priority 7 domU needed 5:07.98. As you can see, the ionice ionice priorities provide an effective way to do proportional I/O allocation. priorities provide an effective way to do proportional I/O allocation.
The best way to apply ionice ionice is probably to look at CPU allocations and convert them into priority cla.s.ses. Domains with the highest CPU allocation get priority 1, next highest priority 2, and so on. Processes in the dom0 should be ioniced as appropriate. This will ensure a reasonable priority, but not allow big domUs to take over the entirety of the I/O bandwidth. is probably to look at CPU allocations and convert them into priority cla.s.ses. Domains with the highest CPU allocation get priority 1, next highest priority 2, and so on. Processes in the dom0 should be ioniced as appropriate. This will ensure a reasonable priority, but not allow big domUs to take over the entirety of the I/O bandwidth.
Backing Up DomUs As a service provider, one rapidly learns that customers don"t do their own backups. When a disk fails (not if-when if-when), customers will expect you to have complete backups of their data, and they"ll be very sad if you don"t. So let"s talk about backups.
Of course, you already have a good idea how to back up physical machines. There are two aspects to backing up Xen domains: First, there"s the domain"s virtual disk, which we want to back up just as we would a real machine"s disk. Second, there"s the domain"s running state, which can be saved and restored from the dom0. Ordinarily, our use of backup backup refers purely to the disk, as it would with physical machines, but with the advantage that we can use domain snapshots to pause the domain long enough to get a clean disk image. refers purely to the disk, as it would with physical machines, but with the advantage that we can use domain snapshots to pause the domain long enough to get a clean disk image.
We use xm save xm save and LVM snapshots to back up both the domain"s storage and running state. LVM snapshots aren"t a good way of implementing full copy-on-write because they handle the "out of snapshot s.p.a.ce" case poorly, but they"re excellent if you want to preserve a filesystem state long enough to make a consistent backup. and LVM snapshots to back up both the domain"s storage and running state. LVM snapshots aren"t a good way of implementing full copy-on-write because they handle the "out of snapshot s.p.a.ce" case poorly, but they"re excellent if you want to preserve a filesystem state long enough to make a consistent backup.
Our implementation copies the entire disk image using either a plain cp cp (in the case of file-backed domUs) or (in the case of file-backed domUs) or dd dd (for (for phy: phy: devices). This is because we very much want to avoid mounting a possibly unclean filesystem in the dom0, which can cause the entire machine to panic. Besides, if we do a raw device backup, domU administrators will be able to use filesystems (such as ZFS on an OpenSolaris domU) that the dom0 cannot read. devices). This is because we very much want to avoid mounting a possibly unclean filesystem in the dom0, which can cause the entire machine to panic. Besides, if we do a raw device backup, domU administrators will be able to use filesystems (such as ZFS on an OpenSolaris domU) that the dom0 cannot read.
An appropriate script to do as we"ve described might be: #!/usr/bin/perl ,@stores,@files,@lvs;
$domain=$ARGV[0];
my$destdir="/var/backup/xen/${domain}/"; system"mkdir-p$destdir";
open(FILE,"/etc/xen/$domain"); while(){ if(m/^disk/){ s/.*[s+([^]]+)s*].*/1/; @disks=split(/[,]/); #discardelementswithouta:,sincetheycan"tbe #backingstorespecifiers while($disks[$n]){ $disks[$n]=~s/[""]//g; push(@stores,"$disks[$n]")if("$disks[$n]"=~m/:/); $n++; } $n=0;
#spliton:andtakeonlythelastfieldifthefirst #isarecognizeddevicespecifier.
while($stores[$n]){ @tmp=split(/:/,$stores[$n]); if(($tmp[0]=~m/file/i)($tmp[0]=~m/tap/i)){ push(@files,$tmp[$#tmp]); } elsif($tmp[0]=~m/phy/i){ push(@lvs,$tmp[$#tmp]); } $n++; } } } closeFILE;