Michael Renner's macro-blog

PostgreSQL talks in June

2010-04-30T16:36:35Z

Here's a short recap about my speaking engagements in June

AMOOCON

AMOOCON takes place from 4th to 6th June in Rostock, Germany. The conference has it's roots in the FOSS VoIP communities but has a more broad focus these days. I will hold a "full length" PostgreSQL advocacy talk as well as a 20 minute PostgreSQL 9.0 primer.

Netways OSDC

It's my second time at the OSDC, again in Nuremberg, Germany on 23rd & 24th of June. I'll hold a talk and give a short presentation on the native replication mechanisms PostgreSQL is going to provide with the upcoming 9.0 release.

As every year, there might be talks & presentations on these topics at the Metalab shortly before or after the two events; follow me on twitter or watch the event list for updates if you're interested in these.

Amsterdam, Volcanoes, Transport in Europe, Conferences, Projects and probably even more

2010-04-18T22:06:54Z

So it's April by now, more than half a year since my last post. Maybe this blogging business was just a temporary thing after all. I guess my reluctance to post anything new was caused by the lack of definitives in my life as of lately, but this is hopefully changing in the near future.

]]> At the moment I'm sitting in Amsterdam, being grounded since Saturday evening by the Icelandic volcano business (which is 80% FUD and 20% facts by my guess), juggling the options of flying home to Vienna tomorrow (if they decide to open shop again and I can actually get a ticket) or just staying here till Thursday for my second round of PowerDNS trainings. I already ruled out going by car (one-way Amsterdam -> Vienna is far too expensive for my tastings, and I won't be doing ~22 hours of driving in 4 days).

Going by train across Europe is still a complete no-go for the spoiled Web-generation I'd call myself a member of, both oebb.at ns.nl don't list fares and seat availability; nshispeed.nl doesn't even know Austria! Deutschebahn.com at least had pricing and seat (un)availability information for the trains from Frankfurt to Vienna, this information combined with first hands experience by a friend of mine ("The terror! The terror I've seen!") suggested that train was not the way to go.

So right now I'm wearing a lent Slipknot t-shirt, waiting for my laundry to finish and cancelling the appointments for the coming week.

Definitives, possibilities and maybes

I quit my job at Geizhals back in November because of cultural differences too large to bridge. Looking at my professional experience so far and the Austrian (especially Viennese) job market my best guess was starting to freelance. The support of the AMS in this regard and the aids for formation of a company in Austria by the Wirtschaftskammer made this step much easier than I had anticipated.

In the meanwhile I also got a few job offers which I'm looking into, the most notable being by the big G. I'll be over in Zurich at the beginning of May, so stay tuned for further updates.

There's also stuff happening in the Viennese Web20 area, the PowerDNS front and I'm also pondering about startup ideas as of lately with Lukas Fittl.

Oh well, I hope I've got a sound plan by no later than the end of the year ;).

Conferences

Conferences! There's a confirmed presence of me at the AMOOCON and a pretty unconfirmed one at the Netways OSDC. All talks are about PostgreSQL, see the conference pages for exact focus.

Sideprojects

On the sideproject front I can welcome

Titanpad

Titanpad was launched as EtherPad replacement, because the latter one prevented new pad creation from the 14th April on and will close shop at 14th of May.

etherhack

etherhack is a project of a friend of mine who wants to bring Linux on generic (somewhat-)managed switches. The research is currently at the beginning, if you're into hardware hacking, writing toolchains and all that stuff it's definitely something to look into.

FM4

The ever-present unofficial FM4 stream is currently on hiatus. We'll see if we can find a new home for it.

PostgreSQL Benchfarm

I dabbled a bit into object oriented Perl with Moose, trying to build an automated and customizable benchmarking framework for PostgreSQL. It was quite a trip but I currently put my project on hold because I'm all out of time and focus ;).

That's all from me for now, I'm off to bed and hunting flights tomorrow.

PostgreSQL Performance Slides, Solaris stuff

2009-09-23T18:10:00Z

As promised, here are the slides of my presentation I held at the Metalab, titled "PostgreSQL Performance: Eine Landvermessung".

And I just stumbled over this blog posting by Brendan Gregg who works for Sun's Fishworks team and was amazed by the level of detail that Solaris' instrumentation data provides. Good stuff!

FrOSCon 2009

2009-08-31T21:16:22Z

FrOSCon

FrOSCon 2009 was a nice break from the stress at work, replacing it by stress in the weekend. The atmosphere was nice as usual and the planning good as every year. And with Andreas Scherbaum playing airport- and venue taxi the transportation didn't leave any room for improvement ;).

A few of the things that stuck with me were

Virtualization & Cloud Management

There's a lot of stuff going on in the Virtualization world, since by now everybody noticed that just hypervising things doesn't cut the slack and that you need to manage the stuff you deployed somehow. Which is a good thing, by the way.

]]> OpenQRM gets Puppet support, finally beating the foil ball. Eucalyptus could emerge as a strong player in the IaaS area (Infrastructure as a Service) especially with the Walrus storage service. And there's a plethora of other projects like OpenNebula, Nimbus, Aspen, Enomaly and Reservoir, which also might have their respective strong points.

But all of them have one thing in common:

They are hardly usable in production environments.

If you need finished products stick to your VMware for now. If you go for one of the FOSS products in a large environment be prepared to hit a few speedbumps along your way and hack lots of essential stuff by yourself.

Apache Hadoop & Mahout

Isabel Drost talked about Mahout, which is a project focused on machine learning, extending from Lucene, doing it's magic with Hadoop. Again very abstract and I get the impression that MapReduce based stuff isn't still quite ready for the unwashed masses.

Side notes: The Apache people start to get scary by now, I bet they're about to start an operating system project very soon now. And both Hadoop and Mahout have Elephant logos!

PostgreSQL & Performance

My presentation was haunted by far too few time to prepare, a experience-wise very diverse audience and far too much content. For the future, I'll pick a skill level in advance and stick to it. And do timings upfront. Promised. ;)

Perl::Critic

René Bäcker gave a talk on Perl::Critic, an interesting module to enforce coding standards in Perl projects (You wouldn't believe it's not an oxymoron!). Most of it's rule set based on Damian Conways' Perl Best Practices so it gives you a good head start for maintainable code. It's very easy to extend so enforcing all your major and minor pet peeves isn't much of a problem.

PostgreSQL (in the real world)

Stefan Kaltenbrunner gave us insights in how the PostgreSQL project infrastructure looks like and what role Panama plays in the big picture ;).

In the afternoon we had a few lighting talks in the PostgreSQL Developer Room. Two excerpts:

I talked a bit about Geizhals.at and how we use PostgreSQL over here

Marek Swierzy from OSSCAD GmbH described how they use PostgreSQL to store temperature readings (among other measurements) which they collect from fiber optic cables over distances up to 12km with a resolution of 0,5m with just the cable as sensor!

See the PostgreSQL Wiki for a complete list of talks.

OpenSQL Camp Database Panel Discussion

More an interactive question time than a panel discussion but interesting nevertheless. Now I finally know what the main market of Firebird is (Embedded database engine for applications). And Blackray seems to be an interesting contender on the FTS market, when you've got enough RAM to throw at the problem.

Summing it up

All in all it was a nice conference with very much content and far too many parallel tracks ;).

Würstel Queue

An interim update

2009-08-16T19:13:18Z

The last two months were very interesting and positively demanding.

]]> Geizhals

I took up a regular (and interesting!) job again at Geizhals (a price comparison platform with origins in Austria), which seems to be one of the few interesting web projects in Austria. The job title says "Head of IT services", at the moment I'm interviewing candidates for the newly created sysadmin team as well as testing components¹ for a future-proof platform to run all Geizhals services. The project has come a long way in the few years I didn't follow it closely and the service landscape got quite a bit more complex in the meanwhile.

¹ Currently at the foundation: HP, Supermicro, Debian, Xen, DRBD, one of the projects formerly known as Heartbeat, Pupppet, etc.

Nehalem & HP ProLiant G6

With the official presentation of the Nehalem architecture HP also launched their ProLiant Generation 6. The QuickPath interconnect was long overdue and is a stab in the heart of AMDs meticulously built up foothold in the server market. It'll be interesting to see how the vendors who switched to AMD in the last few years² will plan their future strategy.

As for HP servers - we've currently got two DL360 G6, E5530, 72GB RAM, 8× 300GB SAS machines in the office.

Things were a bit bumpy at the start (quite a few "must-have" firmware updates), but this is to be expected when a new CPU architecture and chipset is launched, I'll probably follow up as soon as the servers are in production.

A few things I noticed when comparing DL360 G6 to G5:

The servers:

are more quiet
use less power
take ages to POST
are flaky in heavy reboot/test cycles, especially when using virtual media and broken boot loaders
are otherwise what you'd expect from properly engineered TIer 1 servers

² The main reason being Opteron/HyperTransport, because Intel's FSB-architecture didn't scale nicely to more than a few processors

HP & Debian

On the plus side the Debian support for the server tools is much better these days. The effort (apparently spearheaded by Dann Frazier) resulted in a apt-cdrom readable ISO image which is going to be replaced by a proper Debian repository³ eventually.

If you run ProLiant x86 servers the hp-health tools are very nice to have and if you're using SmartArray controllers you'll be delighted by the properly packaged hpacucli.

³ I've set up http://amd.co.at/hpstuff/ in the meanwhile. Use "deb http://amd.co.at/hpstuff lenny/8.25 non-free" as sources.list entry in your Debian Lenny systems.

Debian & Bootloaders

During the testing & deployment of the new HP servers I stumbled over a few gotchas in Debian Lenny.

Installing a LVM-root based system with no standalone /boot partition is apparently unsupported (results in an unbootable system), if you go with Debian's defaults for LVM-root installations you get a system with LILO.

Nothing a manual copying of the /boot files and installing of GRUB 2 can't fix.

But as soon as you try using GRUB 2 on a Xen Dom0 you notice that the grub.cfg generator doesn't support generating Xen compatible config stanzas.

I'll file a few bug/discussion reports in the near future as soon as I've got all the details worked out.

PostgreSQL 8.4

In the meanwhile PostgreSQL 8.4 was also released. No major breakthroughs, but lots of small functionality and performance improvements. See the announcement and the Release notes for details.

The thing I'm interested in the most is the introduction of fadvise calls which make asynchronous kernel-side multithreaded IO-prefetching possible. This is very helpful in situations where index scans of a single backend hit disk and your storage backend can handle more (read) IOs than a single thread can generate on it's own. Expect more on this topic in the near future.

Performance

At work I had the chance to replace a 8 drive SATA SAN with a 32 drive SAS SAN along with a server replacement (Intel Core2/FSB Xeons -> HyperTransport Opterons). Benchmarking these things was quite fun and enlightening, but I had far too few time to properly document everything. One thing that I learned is, that ample amounts of write cache and proper Command Queuing depth come a long way in storage systems ;). Still no solid state devices though.

PostgreSQL talk

I'll give a talk on PostgreSQL Performance at FrOSCon. Still not finished, but it's targeted at beginners (when it comes to performance-related topics) and not entirely PostgreSQL specific. I'll probably repeat it at the Metalab, if there should be enough interest.

SICEKIT

Christian Hofstädtler and I started to generalize the infrastructure documentation framework we started back at Inqnet. The current progress can be seen at http://sicekit.org/, it will probably take a few months till we've got an usable product though.

Data-sniffing trojans burrow into Eastern European ATMs

2009-06-04T17:35:44Z

A catchy headline, as written by The Register. To quote more from the story (Full report with tech details):

The malware logs the magnetic-stripe data and personal identification number of cards used at an infected machine and provides an intuitive interface for retrieving the information using the ATM's receipt printer, [..] Since late 2007 or so, there have been at least 16 updates to the software, an indication that the authors are working hard to perfect their tool.

This is a nice example of what happens, when you ignore the things that are necessary to run an important area of your core business. The business area being the operation of the ATM machines (guess how bank teller utilization would look like if you throw out all ATMs). And a few of the things to run such a part competently would be: security (of the systems, the network), service lifecycle management and configuration management.

]]> To put the situation in perspective:

We've got a network of Windows 98/2000/XP devices, supplied by an ISV, hardly maintained, with the ISV having a proven trackrecord of "being challenged" WRT IT security, running on a scarcely secured network¹ which deals with cash transactions. Is there any reason to worry?

Yes. Deploying machines in such sensitive environments, without having a plan on how one is going to deploy updates, without having a plan on how you're going to spot tamperings, without having a plan on assessing how the security of the system looks like is blatantly incompetent. I can see the guys in charge, stating "Oh, we don't need this, it's an internal network. Nobody is going to have access there!" in meetings when the discussion touches one of the aforementioned topics. And you can bet on the corporate culture of banks to honor such reasonings. Until it's too late.

And the sad part is, that the whole system is most likely in such a bad shape that a proper approach to the situation would take at least months (or weeks, given the availability of domain experts and allowing for outages in production systems). And so they will do what they always do when faced with a problem which escapes the scopes they're fit for: fix the symptoms! There's probably a sorry lad driving around country right now, checking every ATM and deleting the trojan if it's installed. And maybe, only maybe, also fixing up the holes the crooks used to get in in the first place.

The interesting part of the story is the amount of professionalism shown by the bad boys. You can rely on the powers of the market economy, the finesse and level of competence of the russian IT-crooks and an software/infrastructure ecosystem which almost screams for being abused to lead to the exact situation at hand.

And I'm glad it happened. Now the companies involved will run through their stages of grief, probably skipping a phase or two, emerging reinforced. A pretty popular case of publicly displayed corporate griefing would be the timeline of the Mifare Classic security problems. Back then it was basically "There are no security issues, filthy liars!", being followed by "I baked you an injunction, but I failed it" which finally resulted in a "Mifare Plus is an AES-based drop-in replacement for Mifare Classic and will be available later this year".

The sad part in both cases is, that it always takes an event of such gigantic proportions to get the affected companies moving and accept/adopt best practices from the industry.

Proper system administration practices exists since the first "hosts" started to run batch jobs (and was much better back then, as I'm told by IT veterans). And they're even poured in ITIL these days.

Cryptanalysis exists since mankind started to hide messages from each other and was very much professionalized in WWII, making it possible for the Allies to tilt the chances in their favor. And in the case of Crypto-1 it doesn't even take a domain expert to get suspicious. It was a sound obfuscation solution back in '94, but product management should've acted on it in the last 14 years, especially because Mifare Classic started to get used heavily in electronic access control systems in Offices and governmental departments (can't say much about Police or Military, one hopes that they've got better standards there).

In the end, the world will have safer & better systems. Maybe better educated vendors. All at the expense of much stress, pain, fingerpointing and shouting. And all that could've been much easier if the people in question had the room/balls/brains for actually questioning what they're doing, and if It's after all - Good For The Company?

¹ I had a picture of an ATM in a Bank foyer, supposedly somewhere in eastern europe, showing networking equipment and an abundance of cabling right next to it, for everyone to access. But I lost it somewhere on the internet ;).

In defense of architecture diagrams

2009-05-31T20:27:20Z

I just stumbled over an old architecture diagram from one of the projects I used to work on. The type of services and project in question are left as an exercise to the curious reader, since this is not the point of this posting.

What I wanted to show is, how complex multi-tiered applications can be these days, especially when you phase in new services or try to replace old ones by setting up the new services to run in parallel to the existing ones.

]]> Imagine the following scenarios:

New team members

A member gets added to the project. How long does it take him to understand the project from the technical side? How long would it take him if he isn't familiar with the area of business or a domain expert? Chances are high that new project members will create drawings on their own to get a complete picture of the architecture.

Discussions

There's a (maybe even heated) debate over a particular area of the architecture. Nobody has a complete & clear picture of the architecture, since the last discussion is a few weeks old. How long does it take to get your point across when only resorting to a flipchart? How long will it take when you can use an accurate & leigible overview as base for your discussion?

Operations

Your ops team gets alerted because some part of your projects infrastructure misbehaves. How much time is going to be spent to get the source of the problem when there's no overview of the project, trying to figure out which symptoms are causal for the problem or just side effects

Summing it up

Even if you don't need that kind of documentation right now, chance are high that you're going to need it very soon. And if you don't do it nice'n'thorough once, you (or other team members) will repeat the effort multiple times and throw the results away after they're done with them.

So in the name of efficiency, get out the Visio (or Dia, or OmniGraffle...) and draw away!

System Administrator centric online community launched

2009-05-29T11:02:20Z

To quote Jeff Atwood in his blog:

Server Fault is a sister site to Stack Overflow, which we launched back in September 2008. It uses the same engine, but it's not just for programmers any more:

Server Fault is for system administrators and IT professionals, people who manage or maintain computers in a professional capacity. If you are in charge of ...
* servers
* networks
* many desktop PCs (other than your own)
... then you're in the right place to ask your question! Well, as long as the question is about your servers, your networks, or desktops you support, anyway.

]]> I'm really delighted to see this. I liked what Jeff and his friends did with Stack Overflow and always thought that the System Administrators lacked a sensible and well-visited forum of some sorts.

With software developers there are various boards, groups, etc. (albeit mostly language/framework-specific) where one can get sane and considerate suggestions from people who know their box and can think outside of it.

But for system administrators no such generic & popular places existed (Maybe some Usenet groups and probably some areas in the wake of LISA/USENIX, but those are as well-established in Old Europe as Monster Trucks and WWF wrestling).

One of the main challenges System Administrators face is, that compared to most developers who might work in a single language/framework on a single product for weeks or months, sysadmins are depending on the environment, tasked with a very broad area of responsibilities and topics.

At the bare minimum every site should have:

Backup
Restore (think: Disaster Recovery)
Monitoring
Performance data collection
Documentation
Virtualization (by now!)
Patch/Update management
Configuration Management (if the amount of nodes warrants it)
Defined & communicated availability information for the system

Excluding any services which are going to be run on the infrastructure you need a good understanding of products from at least 7 different vendors to setup & maintain this infrastructure. And may god help you if you need to design your infrastructure upfront with products you don't know yet. Especially when it's open source products¹.

And this is were Server Fault comes to the rescue.

You're looking for a backup solution and want to check upfront if Bacula or Amanda are any good or if you should go for the commercial offerings? Heck, you might even want to know about different approaches to short-term backups, like NetApp Snapshots?

You're relatively new to the Virtualization bandwagon and want to know what the production-relevant impediments and features of Xen, KVM, OpenVZ/Virtuozzo and VMware are?

Those are a few examples one can learn through many years in System Administration, in the right environment with the right sort of colleagues.

And this process can be shortened considerably when you've got the right sort of forum, were interested persons can mingle with experienced ones and were even controversial topics (Container-based or Full Virtualization? I dare you!) can be discussed in a civilized manner.

So let's see how this develops, I'll be trolling the site in the meanwhile ;).

¹ As the infrastructure/installation gets larger, proper integration of all tools becomes more and more important. You don't want to find out that your tool doesn't have proper AAA integration for central identity management. You don't want to hack up your own monitoring interfaces, going directly into the products native database because the vendor didn't really anticipate that you want automatic monitoring of your job runs. Those are expected features when a given tool handles more nodes than you can count with all your limbs.

Testing PostgreSQL replication solutions: Slony-I

2009-05-16T00:00:23Z

Slony-I is a trigger-based replication solution which allows you to replicate database tables and sequences asynchronously from one master to several read-only slaves (which can also be cascaded).

Trigger-based means, that each table and sequence which gets replicated has triggers assigned, which will fire whenever the content of the given database object changes. The stored procedures, which are associated in the triggers, will then record the changes and store these in a replication log table. Separate daemons monitor the log table for changes and distribute the changes according to their defined rules.

This approach allows for extremely flexible setups, having different master servers for different tables, but this comes at a price.

]]> First - this kind of replication solution is very complex. There are triggers, stored procedures and very much meta-information (think "What has to get sent where?") in the database, with separate daemons doing the actual work.

Furthermore, dealing with the triggers also necessitates strict rules when it comes to DDL changes. The Slony-I documentation has further information on this topic.

And last but not least, the double write of every change ("in place" and in the logging table) also causes overhead for writes, approximately 2.5 times the data you'd have when not using Slony-I (Numeric and Date/Time values are much larger in the log table, since they only get stored in their ASCII representation there).

See also the Slony-I introduction on their site.

That being said, let's see how this works:

Under the hood

Slony-I components

There're a few things that make Slony-I tick:

PostgreSQL

Since most of the interesting things happen inside PostgreSQL in the form of triggers and stored procedures, Slony-I can naturally not work without PostgreSQL ;).

All Slony-I related information (nodes, replication sets, log entries, etc.) is stored in a schema called "_$SLONYCLUSTERNAME".

slon

slon is the daemon which takes care of the actual data replication, monitoring the Slony-I log tables and applying the changes to the various nodes.

slon_tools.conf

The "shape" of the cluster should be accurately documented in slon_tools.conf. Many Slony-I helper scripts use the information in the slon_tools.conf to generate the necessary slonik commands.

slonik

slonik is the Slony-I command processor, parsing slonik commands and calling stored procedures on the various nodes to reflect the desired changes.

Please also read Slony-I Concepts to understand the terms I'm going to use from now on ;).

The pgexerciser schema

Since using Slony-I requires a good understanding of the schema your application uses, I'll explain how pgexerciser does it's magic. pgexerciser tries to implement an overly trivialized auction application. There are users, who can create auctions and bid on auctions. Every bid is "sanity checked" in the database.

user

 Column |  Type   |                     Modifiers
--------+---------+---------------------------------------------------
 id     | integer | not null default nextval('user_id_seq'::regclass)
 name   | text    |

Boring table, two columns, one Primary Key doubling as the user id, one for usernames.

auction

   Column    |           Type           |                      Modifiers
-------------+--------------------------+------------------------------------------------------
 id          | integer                  | not null default nextval('auction_id_seq'::regclass)
 creator     | integer                  | not null
 description | text                     | not null
 current_bid | numeric                  | not null default 0
 end_time    | timestamp with time zone | not null default now()
Indexes:
    "auction_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
    "auction_creator_fkey" FOREIGN KEY (creator) REFERENCES "user"(id)

Primary Key as auction id, the auctions creator (foreign key constraint on user table), auction description, current highest bid (updated via a trigger on the bid table) and the auctions end time.

bid

 Column  |           Type           |                    Modifiers
---------+--------------------------+--------------------------------------------------
 id      | integer                  | not null default nextval('bid_id_seq'::regclass)
 bidder  | integer                  | not null
 auction | integer                  | not null
 bid     | numeric                  | not null
 time    | timestamp with time zone | not null default now()
Indexes:
    "bid_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
    "bid_auction_fkey" FOREIGN KEY (auction) REFERENCES auction(id) ON DELETE CASCADE
    "bid_bidder_fkey" FOREIGN KEY (bidder) REFERENCES "user"(id)
Triggers:
    update_auction_current_bid BEFORE INSERT OR UPDATE ON bid FOR EACH ROW EXECUTE PROCEDURE update_auction_current_bid()

Primary Key as bid id, the bidder (FK constraint on user table), the auction id (FK on auction table), the bid amount and a timestamp.

There's a trigger which validates every bid (checks if the new bid is higher than the current highest bid and if the auction hasn't ended already) and if it's valid, updates the current_bid in the auction table.

Getting started

As always, please make sure that your environment looks like as described in this post.

Preparing the environment

As a first step, run

master1:~/pgworkshop# ./envorcer slony

This will

create a PostgreSQL superuser called "slony" on both nodes
disable all access constraints on all databases network-wise
create a slon_tools.conf prepared for the pgexerciser schema
copy the pgexerciser schema to the "slave node"
add startup entries for the slon daemons on master1.

The slon_tools.conf

The slon_tools.conf is not necessary for normal operation of a Slony-I cluster, it's just a reference for the altperl Scripts which we will use for cluster administration.

There's few documentation for the config file itself, but it's heavily commented.

/etc/slony1/slon_tools.conf contains the version edited for our schema, /usr/share/doc/slony1-bin/examples/slon_tools.conf-sample.gz is the original file as supplied by Slony-I, which contains more comments.

slonik et al

I won't go into much detail about slonik and the commands it expects - the userland tools we use (mostly) do what they're supposed to do, so there's no need to dive into this right now. See the Slony-I command reference for more information about the slonik commands.

Bootstrapping slony

Running "slonik_init_cluster" generates the necessary slonik commands based on /etc/slony1/slon_tools.conf to initialize a Slony-I cluster, which basically means that slonik will create the special Slony-I schema on all configured nodes. You can either review the commands or just pipe the output to slonik to get started. Afterwards make sure to start the slon daemons which are necessary to actually replicate data.

master1:~/pgworkshop# slonik_init_cluster | slonik
:10: Set up replication nodes
:13: Next: configure paths for each node/origin
:16: Replication nodes prepared
:17: Please start a slon replication daemon for each node
master1:~/pgworkshop# /etc/init.d/slony1 start
Starting Slony-I daemon: 1 2.
master1:~/pgworkshop#

From now on you can monitor the actions of the slon daemons in "/var/log/slony1" on master1.

Now it's also a good time to start pgexerciser to get some movement in the database.

The Slony-I schema

I already mentioned that Slony-I stores much information related to replication in a special schema; to see what's actually in there you can use

master1:~/pgworkshop# psql sqlsim -c '\dt _slonytestcluster.'

See the Slony-I schema documentation for further information on the tables and stored procedures.

Replicating our first few tables

To start the replication of data to the other node, we need to define a replication set first.

I've prepared the set in the slon_tools.conf already, there is a set called "set1" consisting of the tables "user", "bid" and "auction". To create the replication set in the slony schema in the database, we need to run slonik_create_set:

master1:~# slonik_create_set 1 | slonik
:16: Subscription set 1 created
:17: Adding tables to the subscription set
:21: Add primary keyed table public.user
:25: Add primary keyed table public.bid
:29: Add primary keyed table public.auction
:32: Adding sequences to the subscription set
:33: All tables added
master1:~#

As always, you can check the commands slonik is going to run by ommiting the piped call to the slonik interpreter.

Creating the set alone won't buy us anything though, we also need to subscribe a second node to it:

master1:~# slonik_subscribe_set 1 2 | slonik
:10: Subscribed nodes to set 1
master1:~#

In the logfile of node2 we can now see that the data is going to be copied from the master server:

[..]
2009-05-16 00:16:19 CEST DEBUG2 remoteWorkerThread_1: Received event 1,1674 ENABLE_SUBSCRIPTION
2009-05-16 00:16:19 CEST DEBUG1 copy_set 1
2009-05-16 00:16:19 CEST DEBUG1 remoteWorkerThread_1: connected to provider DB
2009-05-16 00:16:19 CEST DEBUG2 remoteWorkerThread_1: prepare to copy table "public"."user"
2009-05-16 00:16:19 CEST DEBUG2 remoteWorkerThread_1: prepare to copy table "public"."bid"
2009-05-16 00:16:19 CEST DEBUG2 remoteWorkerThread_1: prepare to copy table "public"."auction"
[..]

and later on that new data created by pgexerciser is periodically transferred:

2009-05-16 00:19:41 CEST DEBUG2 remoteListenThread_1: queue event 1,1840 SYNC
2009-05-16 00:19:41 CEST DEBUG2 remoteWorkerThread_1: Received event 1,1840 SYNC
2009-05-16 00:19:41 CEST DEBUG2 calc sync size - last time: 1 last length: 4012 ideal: 14 proposed size: 3
2009-05-16 00:19:41 CEST DEBUG2 remoteListenThread_1: queue event 1,1841 SYNC
2009-05-16 00:19:41 CEST DEBUG2 remoteWorkerThread_1: SYNC 1840 processing
2009-05-16 00:19:41 CEST DEBUG2 remoteWorkerThread_1: syncing set 1 with 3 table(s) from provider 1
2009-05-16 00:19:41 CEST DEBUG2  ssy_action_list length: 0
2009-05-16 00:19:41 CEST DEBUG2 remoteWorkerThread_1: current local log_status is 0
2009-05-16 00:19:41 CEST DEBUG2 remoteWorkerThread_1_1: current remote log_status = 0
2009-05-16 00:19:41 CEST DEBUG2 remoteHelperThread_1_1: 0.001 seconds delay for first row
2009-05-16 00:19:41 CEST DEBUG2 remoteHelperThread_1_1: 0.003 seconds until close cursor
2009-05-16 00:19:41 CEST DEBUG2 remoteHelperThread_1_1: inserts=3 updates=2 deletes=0
2009-05-16 00:19:41 CEST DEBUG2 remoteWorkerThread_1: new sl_rowid_seq value: 1000000000000000
2009-05-16 00:19:41 CEST DEBUG2 remoteWorkerThread_1: SYNC 1840 done in 0.025 seconds

And when we check the slave server the data also looks good:

slave1:~# psql sqlsimslave -c "SELECT * FROM bid ORDER BY id DESC LIMIT 3"
  id  | bidder | auction |  bid   |             time
------+--------+---------+--------+-------------------------------
 2164 |     11 |      86 |   9.86 | 2009-05-16 00:34:15.510123+02
 2163 |      7 |      83 |  46.96 | 2009-05-16 00:34:15.177281+02
 2162 |     11 |      64 | 267.12 | 2009-05-16 00:34:15.16756+02
(3 rows)

slave1:~#

About SYNCs

Data between nodes is only replicated with every SYNC event. Additionally, Slony-I will introduce SYNC events periodically as a way to allow monitoring solutions to check if a node has fallen behind too much.

The Debian packaged slon will check for new data every second and introduce a SYNC event if it finds any. If there was no SYNC event for 10 seconds it will introduce a "keep-alive" SYNC.

Adding new objects to replication

We knowingly ignored the sequences (used for the primary keys) in our schema when defining the first replication set - a quick check on the subscriber server shows that they're troublingly low compared to the origin:

master1:~# psql -h slave1 sqlsimslave -c "SELECT nextval('bid_id_seq')"
 nextval
---------
       1
(1 row)

master1:~# psql sqlsim -c "SELECT nextval('bid_id_seq')"
 nextval
---------
    2931
(1 row)

master1:~#

Slony-I doesn't allow you to add new objects to an existing replication set, you have to define a new set and then merge it into an existing one:

master1:~# slonik_create_set 2 | slonik
:16: Subscription set 2 created
:17: Adding tables to the subscription set
:20: Adding sequences to the subscription set
:24: Add sequence public.auction_id_seq
:28: Add sequence public.bid_id_seq
:32: Add sequence public.user_id_seq
:33: All tables added
master1:~# slonik_subscribe_set 2 2 | slonik
:10: Subscribed nodes to set 2
master1:~#

And now the sequence on the slave server is also correct again:

master1:~# psql -h slave1 sqlsimslave -c "SELECT nextval('bid_id_seq')"
 nextval
---------
    3267
(1 row)

master1:~#

And to reduce the amount of sets to maintain:

master1:~# slonik_merge_sets 1 1 2 | slonik
:10: Replication set 2 merged in with 1 on origin 1
master1:~#

Be sure to update the set definition in slon_tools.conf every time you modify a set!

Homework!

I think by now you've got the hang of the slonik tools.

Try to play through the following scenario:

Defining some data

Since DDL changes in Slony-I environments are not to be taken lightly, try applying the script in /root/pgworkshop/configs/slony/add_start_time.sql with slonik_execute_script.

Moving on

Node1 needs to have some maintenance downtime. Move the replication set from Node1 to Node2. Check the last bid in pgexerciser. Restart it with "./pgexerciser -h slave1 -d sqlsimslave".

Shit hits the fan

Node2/slave1 experiences a horrible case of "killall -9 postgres". Failover the replication set back to Node1. Check pgexerciser.

Rebuilding our shattered dreams

Restart PostgreSQL on slave1. Since Node2 is now in an indeterministic state as far as Slony-I is concerned, you need to rebuild it from scratch. Cheat sheet: slonik_drop_node, slonik_store_node, slonik_subscribe_set.

Final words

Slony-I is not for the faint of heart. To quote the documentation:

Thus, examples of cases where Slony-I probably won't work out well would include:

[..]
Sites where configuration changes are made in a haphazard way.
[..]

And regarding DDL changes:

Unfortunately, this nonetheless implies that the use of the DDL facility is somewhat fragile and fairly dangerous. Making DDL changes must not be done in a sloppy or cavalier manner. If your applications do not have fairly stable SQL schemas, then using Slony-I for replication is likely to be fraught with trouble and frustration.

So, test your procedures beforehand, document everything, monitor everything and be extra-sure when modifying the cluster.

Be wary that the slon daemons are as important as the PostgreSQL databases itself, so treat them as such (especially when it comes to HA/Failover)

But in the end, if you treat Slony-I nicely it's a trusty, reliable and proven solution for your asynchronous master-to-multiple-slaves replication needs.

Testing PostgreSQL replication solutions: Log shipping with walmgr

2009-05-12T17:19:14Z

As we've seen in our previous example, doing log shipping with pg_standby can be quite a hassle if you take your slave servers regularly online to use them for queries and then want to resume replication again.

The guys from Skype were probably faced by exactly the same problems when they decided to write walmgr.

If you're not familiar with log shipping I strongly suggest to read the previous post first.

]]> walmgr?!

walmgr is a tool written in Python, which eases deployment and maintenance of log shipping slaves. It provides easy one-shot-commands to create backups from running PostgreSQL servers, implements WAL-file-management (deleting files not needed anymore) and makes bringing slave-servers online for production use a breeze.

Furthermore it can also be configured to perioidically sync the currently used WAL-segment. This greatly reduces the amount of lost transactions when a slave server has to be brought online "as is".

I'm sold! Please tell me how this works, Jim!

A few warning words upfront

For basic setup of the virtual machines see the first article in the series. Prepare the walmgr environment with

master1:~/pgworkshop# ./envorcer walmgr

walmgr relies on fairly extensive configuration files, pointing to all the necessary infrastructure to do it's magic. Additionally, you've to take care to do all operations as the "postgres" user, since walmgr does a lot of copying around and does not enforce correct ownership of all files it touches by itself. Permission issues can be tedious to work out and walmgr isn't especially helpful to point out which files/directories need to be corrected.

A short word on configuration

In /root/pgworkshop/walmgr reside all necessary tools, configuration files and documentation for walmgr. Most of the parameters in wal-[master|slave].ini are self-explanatory, the puzzling ones are documented in walmgr.txt.

The whole directory including the wal-slave.ini is copied to the slave server when running the "envorcer" script. The wal-master.ini is only used on the master server and the wal-slave.ini is only used on the slave server. Because of this, they contain a bit of redundant information.

Setting up the master

Since I've already prepared all the necessary configuration, we can dive right in.

First, we need to prepare the master server for log shipping with walmgr:

postgres@master1:/root/pgworkshop/walmgr$ ./walmgr.py wal-master.ini setup

This sets archive_command, enables archive_mode in the postgresql.conf of the given cluster and creates the directory structure needed by walmgr on the slave server. You should also set archive_timeout to 60 seconds to get some segment switching in our test scenario.

Then restart the PostgreSQL cluster ("pg_ctlcluster 8.3 walmgr restart") and start the pgexerciser.

At this point PostgreSQL happily writes transactions to it's WAL, whose segments get switched every 60 seconds and copied to the slave server as per the configuration in wal-master.ini.

Starting recovery

To start recovery on the slave server you just need to run:

postgres@slave1:/root/pgworkshop/walmgr$ ./walmgr.py wal-slave.ini restore data.master

It's important to explicitly specify the name of the backup (which can be listed/shown with the "listbackups" command) to have walmgr copy the backup from it's archive-directory ("/srv/walmgr-data") to the $PGDATA path; if you don't specify it walmgr will move the latest backup to $PGDATA, making the particular backup unavailable for any future recovery operations.

To see if the log shipping is working, see the "Doing some transactions" section in the previous post.

Bringing up the Slave

If you want to use the slave server to do some actual work you have to bring it online first:

postgres@slave1:/root/pgworkshop/walmgr$ ./walmgr.py wal-slave.ini boot

This will stop recovery and bring the database online, voiding the copy for further recovery/replication use.

Resuming Replication

To resume the recovery operation a simple

postgres@slave1:/root/pgworkshop/walmgr$ ./walmgr.py wal-slave.ini restore data.master

does the trick. Be aware though, that PostgreSQL has to replay all WAL-files which have accumulated since the time the backup has been run. On databases with write-heavy loads this can take quite some time.

Cutting the losses

walmgr can also be daemonized to synchronize the currently active WAL-segment at periodic intervals. This reduces the amount of lost transactions from "transactions since last segment switch" to "transactions in the last $loop_delay seconds" when bringing the slave server online.

I suggest running the following command in a screen terminal:

postgres@master1:/root/pgworkshop/walmgr$ ./walmgr.py wal-master.ini syncdaemon

since walmgr won't detach from the terminal and inform you on what's happening on STDOUT.

If you bring the slave online when syncdaemon is running, the most recent entry in the bid table shouldn't be older than the interval configured in the config file.

Testing PostgreSQL replication solutions: Log shipping with pg_standby

2009-05-02T19:52:00Z

Log shipping?!

PostgreSQL offers support for "shipping" it's WAL, the Write Ahead Log, where the changes of every transaction are recorded, to other database systems. The other database system then reads the changes from the WAL file and applies the changes to it's local data store.

Log shipping has the drawback that the slave servers can't be used for queries as long as they are replicating data and cannot be put back in replication after they've been taken online. Additionally the replication isn't very granular, PostgreSQL natively itself will accept only completed WAL files.

On the other hand this mechanism is very efficient and very reliable since the WAL is at the core of normal PostgreSQL operation.

]]> The WAL

The WAL files of a PostgreSQL database can be found under $PGDATA/pg_xlog, in Debian $PGDATA is usually /var/lib/postgresql//. Every WAL segment is 16MiB in size (compile-time default) and it's name consists of three separate counters:

Naming

If we take the name "00000001000000030000008E" it tells us that the timeline of the file is "1", that it belongs to the logical log file (logid) "3" and that it's the 142th (0x8E) segment of the given logfile.

The segment counter increments with every segment switch, the logical log file is incremented (and the segment counter reset to 0) whenever a new segment would overflow the 32bit address space (or "4GiB") of a logical logfile. With a standard segment size of 16MiB this happens every 255 segments.

Switching segments

A WAL segment gets switched when one of the following things happen:

it's full (16MiB worth of changes have been written)
archive_timeout is exceeded
pg_switch_xlog is called

Replicating

The mechanism used for reading in WAL files on a slave server is very close to the mechanism that is used when PostgreSQL recovers from an unclean shutdown:

The daemon doesn't know in what state the heap files (tables, indexes, etc.) are and therefore consults the WAL, where changes of every transaction are written to, replaying every transaction since the last CHECKPOINT.

Because the same code-infrastructure is used, the replaying of WAL files is called "recovery mode".

Shipping the files

PostgreSQL has an archive_command parameter which can be used to configure a command which gets called after every segment switch. This makes it easy to copy completed WAL segments from the master server to a remote system with various mechanisms, e.g. nfs, scp, rsync, etc.

Recovering

To configure a server for recovery you need to place a file named "recovery.conf" into it's $PGDATA directory. A sample recovery.conf might look something like this:

restore_command = '/usr/lib/postgresql/8.3/bin/pg_standby -l -t /var/lib/postgresql/logship.trigger /srv/logship-archive %f %p'
log_restartpoints = 'true'

# for PITR
#recovery_target_time = '2009-04-21 19:00:00'

Additionally the server needs a consistent backup in it's $PGDATA directory and access to all WAL files that have been written since the backup.

When started in recovery mode, PostgreSQL will replay WAL files until the program referenced in restore_command returns. After that it will take the database online, increment the timeline counter of the WAL file and effectively prevent that the current database can be used as target for recovery again. This is necessary, because modifications can happen to the tables as soon as the database is taken online.

pg_standby

pg_standby is a contrib tool that watches a given directory for new WAL files and makes these available to PostgreSQL via copying/linking the given files into it's pg_xlog directory.

When using pg_standby there are two main mechanisms for ending replication:

"Pulling the trigger", meaning: creating the specified trigger file
Feeding an incomplete WAL-file: Imagine a crashed server that doesn't boot anymore: if you could salvage the active WAL segment and copy it to the recovery server, PostgreSQL will notice that the WAL segment is incomplete and perform it's normal startup procedure as well as incrementing the timeline.

Resuming recovery

After a slave server has been taken online (and it's timeline was switched) you must copy a backup from the master server and create a new recovery.conf to resume log shipping operation.

Doing it all

Now that we know what to do and how these things work, let's break a few things!

Preparing

Preparing the environments should be rather easy, first make sure, that your machines are setup correctly.

When both machines are running, run the following command:

master1:~/pgworkshop# ./envorcer logship

This creates a cluster named "logship" on both servers, creates a database for pgexerciser on master1 and installs it's schema to the database.

Additionally, it creates a directory on slave1 where the WAL files will be copied to, enables archive_mode among a few other settings on master1 and copies a base backup of the database & an appropriate recovery.conf to slave1.

Doing some transactions

Start the databases on both servers with pg_ctlcluster and run pgexerciser (no arguments needed) on master1.

archive_timeout is set to 60 seconds, so a logswitch should occur every minute. This can be monitored in a few places:

The "archiver" process on master1 and the "startup" process on slave1 will show in their processtitle what WAL file they have handled or are expecting next
PostgreSQL also keeps track of which files have already been copied on master1 in $PGDATA/pg_xlog/archive_status
The PostgreSQL logfile on slave1 (found in /var/log/postgresql/) will show when the WAL files have been processed

Breaking stuff

Now it's up to you. You could either create the trigger file pg_standby watches, "killall -9 postgres" on the master and copy over the active WAL segment or try a PITR (Point in time recovery)

Resuming recovery, this time for real

After you took the slave online, use the following steps to get back into recovery mode:

slave1:~# killall -9 postgres
master1:~# psql postgres -c "select pg_start_backup('foo')"
master1:~# rsync -avH --delete --delete-excluded --exclude pg_xlog/*  /var/lib/postgresql/8.3/logship/ root@slave1:/var/lib/postgresql/8.3/logship
master1:~# psql postgres -c "select pg_stop_backup()"
master1:~# scp pgworkshop/configs/logship/recovery.conf root@slave1:/var/lib/postgresql/8.3/logship/

When you start the PostgreSQL cluster on slave1 again, it should start in recovery mode again. More on backing up PostgreSQL databases can be found in the documentation.

Testing PostgreSQL replication solutions: Basic Setup

2009-05-02T16:43:00Z

I want to provide an introduction, annotated examples and an easy to setup test environment for a few common and "simple" PostgreSQL replication solutions.

I planned on providing images, but after what I've seen so far it seems to be much easier to just provide a HowTo ;).

Prerequisites

I chose Debian as test platform because I'm familiar with it and the PostgreSQL related packages are in excellent shape there.

What you need is

two separate Debian Lenny instances
with the following packages installed:
- postgresql postgresql-contrib postgresql-8.3-slony1 slony1-bin mercurial less libdbd-pg-perl libpoe-perl rsync psmisc ssh screen python-psycopg2 libstring-random-perl
which are reachable via the respective hostnames "master1" and "slave1"
Where both the root and the postgres user from master1 can ssh into root and postgres on slave1

]]> Using VirtualBox

If you're using Virtualbox you can use this as a rough draft:

Create master1 machine, 8GB dynamic disk, 256MB RAM, three NICs:
- Adapter 1: NAT
- Adapter 2: Internal Network "intnet"
- Adapter 3: Host-Only network (optional, only needed if you don't like using VBox's console)
Install Debian Lenny (look for "netinst"), set hostname to master1, don't select any profiles in the tasksel screen since it's not necessary

apt-get install postgresql postgresql-contrib postgresql-8.3-slony1 slony1-bin mercurial less libdbd-pg-perl libpoe-perl rsync psmisc ssh screen python-psycopg2 libstring-random-perl perl-doc
cd /root; hg clone https://workbench.amd.co.at/hg/pgworkshop/
ssh-keygen -q -t dsa -f ~/.ssh/id_dsa -N ""
cp /root/.ssh/id_dsa.pub /root/.ssh/authorized_keys
cp -a /root/.ssh /var/lib/postgresql
chown -Rv postgres:postgres /var/lib/postgresql/.ssh/
rm -v /etc/udev/rules.d/*-persistent-net*
echo "10.1.0.11       slave1" >> /etc/hosts

Then configure eth1:

cat << HERE >> /etc/network/interfaces

auto eth1
iface eth1 inet static
address 10.1.0.10
netmask 255.255.255.0
HERE

Stop the instance and snapshot it, for good measure
Add a second machine named slave1, identical configuration to master1, choose the same disk as master1. This will cause VirtualBox to use the state of master1 as snapshot source for slave1.
Boot slave1, change the IP of eth1 in /etc/network/interfaces to 10.1.0.11 and change the hostname in /etc/hostname to slave1
Boot master1, reboot slave1
ssh root@slave1, ssh postgres@slave1 from master1 should work now.

And you're done!

PostgreSQL on Debian

Debian offers a few tools to manage multiple Postgres "clusters" (as in "instance").

"ls -l /usr/bin/pg_*cluster" shows all available commands, we will use pg_ctlcluster regularly to start, stop, restart or reload clusters.

Custom tools

I've written two tools to make testing replication scenarios easier. These can be found in the mercurial repository at https://workbench.amd.co.at/hg/pgworkshop/. The tutorials assume that the repository has been checked out to "/root/pgworkshop".

The envorcer

There is a script called "envorcer", which is basically an "environment enforcer". It prepares the PostgreSQL databases & needed configuration for the test cases.

It is very destructive, so it's got a hardcoded hostname check so that it can be only run from master1.

Running it without arguments shows a short usage example, the source code is pretty self-explanatory and fairly commented ;).

The pgexerciser

The pgexerciser is in the same directory as the envorcer and is used for exercising a given PostgreSQL database. See ./pgexerciser --help for documentation.

PostgreSQL repliziert: Ein Workshop

2009-05-02T15:40:03Z

Der Workshop ist nicht so gelaufen wie ich mir's erwartet habe.

Ich war noch etwas gefertigt von der halben Grippe die ich mitgenommen habe, insgesamt waren nur drei Personen dabei¹, Virtualbox hat bei niemandem out of the Box funktioniert (Fuck this, I'm going back to VMware) und mit dem Scoping ist's bei einer sehr kleinen, Erfahrungsmässig weit verstreuten Gruppe, auch immer extrem schwer.

Damit der Rest der Welt aber auch etwas davon hat (und ich den beiden auch noch was zum durchtesten geben kann) werde ich den Praxis-Teil in ein paar Artikeln aufbereiten und hier zur Verfügung stellen.

Die Slides vom Workshop gibts mal hier

¹ Und nach 'ner Stunde war ich mit Kristian allein, weil ein Teilnehmer seinen Zug erwischen musste und ein anderer ob des dysfunktionalen Virtualboxes lieber noch einen Talk erwischen wollte.

The OSDC 2009 is over

2009-05-02T15:27:00Z

It was a nice conference, the guys and gals from Netways surely know how to run an event. It's all the nice little details which make up a great experience¹.

I was also surprised by NH Hoteles, the Nuremberg City one greeted us with one of the most attractive parking garages I've ever seen (_very_ clean, "follow me"-lines on the floor, automatic hinged safety doors, complimentary window cleaning for hotel guests, etc.) and the hotel lived up to the standards it set in it's garage ;). The only problem I noticed was that the dining area was constantly understaffed for the 70-something people which attended the conference.

The lineup of the conference was quite nice although I prefer "war stories" told from real world scenarios over feature presentations of a single solution. Fortunately Kristian Köhntopp was able to speak about his experiences from his times as a MySQL consultant and the stuff he's doing over at booking.

Puppet

Luke Kanies (Reductive Labs) talked a bit about Puppet, which most of the attendees already knew. It's still the best configuration management solution for heterogenous environments where the "foil ball" approach (his words!) of golden master images don't cut the slack anymore. Another part of his talk was targeted about how the Puppet development approach and community integration is way better than what he experienced with the author(s) of cfengine back then, which eventually caused him to start his own thing. Puppet shows progress in critical areas (dropping XML-RPC in favor of REST to increase performance especially when serving static files) but still has a long way to come. One of the issues Kristian mentioned is, that Facter only supports scalar values natively and no complex data structures. This is very limiting when you need to analyze complex data structures e.g. the LVM configuration of a server.

DRBD, the stuff that was formerly known as Heartbeat & KVM

Florian Haas (Linbit, the company behind DRBD) showed how Virtualization & HA play together with the building blocks being KVM, Pacemaker, OpenAIS and DRBD. He talked a bit about the infighting in the Linux-HA/Heartbeat community, which eventually lead to the current Pacemaker & OpenAIS solution (which is not yet available in stock Debian systems). One of the issues full and paravirtualization techniques have over container-based solutions like OpenVZ and Solaris Zones is performance. He presented a few slides from his talk at the Percona Performance Conference 2009, showing latency issues in KVM, which are very bad in systems with large amounts of unbatched transactions. Since his results were only a week old it's too early to comment about the reasons and resolutions, the bottom line was that it might be too early to bury Xen until these things are resolved.

Over a talk with Florian I was finally able to stop worrying and love shared-nothing architectures. Florian told me that my association of DRBD with "something to keep services on shoddy hardware online" wasn't too far-fetched, since the first version of DRBD was written out of the need to run complex computation jobs on rickety machines in a CS lab without loosing the complete calculation if one of the nodes hit the dust in A Bad Way. But since then DRBD has evolved considerably since then and with the overwhelmingly positive feedback of other conference attendees and DRBD's availability in stock Debian and RedHat distributions I'm finally convinced that it's A Good Thing ;).

Incubation completed in 3... 2... 1...

The other talks of the day that I attended weren't that interesting and the latent flu I brought with me from last week finally started to kick in, causing me to call it a day at 19:00 and sweat through the night.

Systematic management of 1000 heterogenous nodes

On the second day Kristian Köhntopp (Booking.com) started the day with a talk about how they do systems (and database) management at their shop (in a hurry: HP hardware for easy deployment and MAC address management, PXE and atftpd (with custom database backend) in combination with Kickstart for basic setup, puppet and yum for everything else). The basic system they install, which is identical for every server, is a minimum CentOS installation with a Puppet client, all customization is done afterwards with puppet. Kris also told some stories about outstanding Puppet issues (Facter/Puppet only handling scalar values, random Facter state corruptions, horrible fileserving with Puppet 0.24, etc.) but it is still the best tool for the job and way more flexible than cfengine which, they used previously.

Why has MySQL still got a market share in professional environments?

I had an interesting talk with Kris over a glass of peach juice (NH is sooo exclusive!) about why MySQL's oddities don't hurt that much in BASE-environments and why a simple and (somewhat) flexible replication solution is of utmost importance in such scenarios. Expect more on that topic in this blog in the future. I won't say that I'm convinced that MySQL is the best solution in those environments, but at least now I understand that it's a viable choice ;).

Postfix ate my Spam!

Charly Kühnast (RZ Niederrhein) then presented his Postfix-based spam filtering solution. I forgot the exact numbers, but it looked very promising. The basic components were (from the top of my head, I think he talked about 6 tiers but was only able to remember 5...)

Policyd with RBLs
Header checks (HELO, sender/recipient verification, etc.)
SpamAssassin
FuzzyOCR
ClamAV with custom definition files

which were quite effective in combination and very low in maintenance requirements by his own words.

Wrapping it up

After that it was time for my workshop (more on that later) followed by discussions and a final beer with the guys from Netways and the few attendees which were still around.

I hit the road with Mika at 18:15 and we were back in Vienna 5 hours later...

Graphing heterogenous data sets with multiple axes

2009-03-15T21:44:47Z

A while ago I wrote a small script which runs benchmarks against given filesystems and collects performance data for each run. What I wanted to find out is, how expensive (IO-wise) various standard filesystem operations are.

The collected informations proved to be quite extensive and very hard to visualize.

]]> But why?!

I always wondered how much faster a given filesystem is for specific tasks and more importantly - why?

Some dogmas which exist are:

ext2 is fast for sequential I/O
reiserfs is fast for handling many small files
xfs is fast for deletes
everything except the ext* family of filesystems will eat your data for breakfast at the slightest chance of blockdevice or kernel issues

but will those live up to scrutiny?

The benchmarks

What I did was to define some basic isolated filesystem workloads which are supposed to benchmark different areas of a given filesystem. What I came up with was:

write a 4GB file with cp
read a 4GB file with cp
delete a 4GB file
create many files (untar 2.6.[0,5,10,15,20,25] linux sources)
read many files (rsync given files to an empty directory)
stat many files (rsync given files to the previously filled directory)
delete many files (delete given files)

The tested filesystem was unmounted between each run to simulate cold caches. I collected the extended io statistics from /proc/diskstats, the interesting bits being the amount of IOs and the sectors read/written during the run as well as the total duration.

The system I used for testing is a Athlon64 X2, running Debian Lenny with stock kernel. The filesystems were created on a very dated Seagate Barracuda 7200.7.

Although these tests are highly unscientific¹ they already yielded some very interesting, and much more importantly, reproducible results.

The data

A sample result set from one benchmark run can be found here. If you've got a high pain threshold (and/or a soft spot for raw numbers) you can already deduce some interesting facts from this list, e.g. that ext4 is much faster than ext3 for most operations, or that xfs is embarrassingly slow when creating many files. But to get a big picture of what's going on here you need to visualize the data.

I did a bit of sketching and came up with something like this:

I wanted to stack identical units (e.g. read & write IOs or read & written sectors) to form a single bar and preserve space this way. Additionally I wanted to group the bars together to make comparison easier and improve the overall graph layout. To make things even more complicated I wanted to combine three different units (IOs, bytes and seconds) on a single graph.

After a bit of reading I found out that the result is supposedly called a "grouped, stacked bar graph with variable y axes". That was my goal.

Tools of trade

Having hardly any experience with data visualization I turned to gnuplot and got disappointed. Only up to two axes per graph and dimension, sparse documentation for the things I wanted to achieve and a mailinglist which never accepted my "anonymous" Gmane post which was stuck in the moderation queue.

The various Flash rendering frameworks like Google Chart seemed promising but didn't live up to my rather specific expectations.

Then a friend of mine pointed me to SigmaPlot which he used for his diploma thesis and spoke highly of.

I gave it a try, and after a bit of trial and error (and dropping the stacked bar requirement) I had my first graphs. Implementing multiple axes isn't too easy with SigmaPlot either (and seems very "bolted on" rather than nicely integrated), but at least I had my first visualized data sets.

The results

This graph was the first I did and was grouped by filesystem because this was much easier to accomplish.

The second graph resembles the one I drafted in the beginning, minus the stacked bar graphs (which is a pity, since there's interesting information lost²).

So what do these graphs tell us? For the given test case (create a few hundred thousand files):

ext4's performance is almost identical to ext2's, which is great to hear
The amount of sectors need to be read and written are pretty closely grouped except for xfs (with reiserfs setting the lower boundary)
for the ext* family and reiserfs the amount of IOs correlated with the overall runtime
Both xfs and jfs seem unsuitable for general usage, at least with standard mkfs and mount parameters on Debian Lenny.

And now?

To be honest, I'm not too fond of the results I got. The amount of time necessary to get the graphs in question seems prohibitively high. Also, the results will never satisfy all people since they're rather static and may contain too much "noise" or not the right combination of data points for a given question you want to answer.³

If you've got any suggestions on different tools or approaches these would be highly appreciated.

And if I don't get any new input I'll eventually re-run the benchmarks with a more recent kernel (adding a stable ext4 and btrfs to the mix), check if the jfs and xfs results are representative and last but not least average a few iterations and increase the working set to get solid results.

And always remember:

toothpastefordinner.com

Scripts, etc.

In case you want to run your own benchmarks, you can find the highly undocumented and uncommented scripts here. Basic instructions for creating the graphs with SigmaPlot can be found here.

Footnotes

¹ What's the buffer size which cp uses for copying? Am I stalled by reading/writing from/to the "helper" filesystems? Are the collected numbers representative for "normal" usage with warm caches?

² E.g. "How many read IOs does a filesystem need to do to delete a single large file?"

³ Interestingly these are similar issues which you will also have when comparing tools like Munin and Zabbix. The former is rather easy to set up but will bite you when you try the simplest form of data correlation, especially for older data sets. The latter is a huge PITA to set up but offers very sophisticated and dynamic tools for data analysis and correlation.