Disaster Strikes

I ought to know better, there are loads of sayings that I should take heed of.

“Don’t fix want ain’t broke” is probably one of them, followed very closely by “If it can go wrong it will go wrong”.

It was started by my daughter, who recently received some money for here 21st birthday and decided to build herself a new computer with some TV tuners, so she could watch TV whilst she was working at university.

I decided to try and improve the security of my hard drives. I was seduced by the thought that I could buy a 160GB hard drive for about £40, and thought about the long term plan to add TV tuners to the server to provide for the backbone of a TV server for the house. This would of course need storage for all the video files that we would collect.

But the little voice in my head would try and seduce me even more. “Hard drives are so cheap now that you could store your data archives on them rather than CD-ROMS”. So the plan was hatched. Buy two SATA hard drives and a PCI SATA controller and add them to the server as a new storage capability. Use some space for video, but set up the remainder as raid, and make a secure storage for archive data. Order placed, parts delivered.

First problem; Installed the PCI card in the server, and whilst it would recognize the drives, if any drives were connected to it, it would not get past the BIOS checks in the computer, and the computer would fail to boot. Calls to the hard drive manufacturer where no help – their advice, buy a different card. Just before I did, I tried the card in my workstation PC, but no joy their either – BIOS (at the latest release) would not boot up.

So, I since the first SATA PCI card cost only a few pounds, I decided to take the plunge and buy another one. Bringing it back home, I tried it in the server – still no joy, could not get past boot. Tried it in the workstation and it worked great, no problems.

So back to the drawing board for a rethink. If I took my four largest IDE drives from both machines (200,80, 60 and 40GB) and put them in the server I could have a complicated arrangement of raid1, raid5 and no raid, and a liberal sprinkling of LVM to create sufficient space to handle video, image and audio files for all the family, and continue run its existing role as web, mail and gateway for the internet and a network server for the home lan. I could put the two 160GB sata drives in to a raid1 mirror and use them for the workstation. The archive would have to stay on this workstation (where it had been in the old arrangements – that is until it was written to CD-ROM after sitting there for 6 months).

In addition, I worked out as regime of backups, so that all key data stores on any one machine had a recent backup copy on the other, and the most important datastore (my online database) would actually have versioned backups also stored in the archive.

Second Problem; Goodness knows why, but this seemed a great time to move my server, which had been Sarge based over to Etch. Why? Well I had this idea to move to Tomcat5 for my backend application server and this was not supported on the old release. But I should have known better. I chose the time that there was a bug in libdevmapper which somehow prevented me accessing LVM volumes, so in the mids of all the potential termoil of moving disks around, I suddenly couldn’t boot the system. I think it probably took a whole weekend to sort that one out.

Third Problem; All this re-arrangement a complicated movemnet of the data between machines as the IDE drives were all to be moved over to the server, and re-ordered as to which IDE channel they were on, and partition sizes changed. Still, I had mostly used LVM for my filesystems, so it should be easy to clear down spare partitions, turn them into new PV devices with pvcreate, add them to a volume group with vgextend, and use the pvmove command to move the data onto this spare space. Finally, the old location could be removed from the volume group with vgreduce, and the old partition released with pvremove. This should be the ideal way to clear down a drive and move it to somewhere else.

Wrong the problem started when I was in the middle of a pvmove, and went to another (psuedo) terminal and attempted to start another pvmove. Unfortunately I started at this point to learn the hard way that pvmove is a very fragile operation and any LVM activity in parallel has a tendency to cause things to lock solid, leaving corrupted metadata and sometimes (and it would be crucial times in my operations) lost data.

Fortunately, I have always been very careful with backups. I have had a regime of always having another a backup copy, less than one day old, on another hard disk, so that in the event of a problem I could recover. This was one of the times that such a backup was invaluable. But another few valuable hours was wasted recovering the data, although I was now starting to feel vunerable, because as the filesystems were getting more and more optimised as I was squeezing things into small partitions, I have to switch off backups to avoid them filling up space I was trying to free up. Still it would be finished in the next day or so – or would it?

Fourth Problem; I had almost rebuilt most of my systems under raid, moving data between machines, partitions, raid arrays and lvm physical volumes.

I had organised my workstation to have 4 partitions on each of the disks – which apart from swap space was all formed into 3 raid1 arrays. Root, /boot where direct mounts in raid devices, the last raid device was a huge lvm volume group where I could allocate extra space as needed.

The server was a little trickier, with different sized disks, and the need to allocate 100GB for a video storage. In the end I have about 40Gb in pairs (raid 1 arrays) for root /boot and and some lvm volumes for filesystems to hold key data, 120GB of raid 5 space across three disks by using 60GB from each of three disks, and two 500M raid 0 partitions for the log files and about 110GB as a single lvm volume spread over the remaining small holes on all the disks. Because of its size, the 200GB disk was involved in lots of these, and needed more than four partitions, so needed some logical partitions. This was almost working, and needed a last couple of pushes before getting there. And it was here that I made the crucial mistake. I was going to have to delete the backup of some data that I was about to move into the new partitions. Instead of moving it elsewhere to ensure I had a backup copy, I just deleted the backup and started to move the real data into the space it had just occupied.

Guess what – no sooner had I moved the data than the disk complained that there was a problem with all the partions (in fact only one) in the logical partition part, and try as I could, there was no recovery and no backup.

This just happened to be the area that held

  • Web site, both the static and dynamic content
  • The photo gallery for the web site
  • The database behind the web site
  • The git repositories that are presented to the public

Fortunately, I have managed to recover most of this data from source. The web site, and dynamic content applications were available from my development environment on the workstation, and a month old copy of the database was available also for testing purposes on my workstation.

I do have the pictures to put back in the gallery (although have lost the comments that I put against each picture), but I have to recreate the theme to match it in to the rest of the web site.

The git repositories can just be recovered from my workstation repositories, except for some minor issues that will take a little time to resolve. These are:-

  • Specialist hooks, that created tarballs of releases for the download area
  • Gitweb configuration to make it work within my web site environment

So I am almost up and running again, and will get the rest up in a short while. I am currently engaged in another project (see articles to follow shortly) that is distracting me.

But what I do have now is a much stronger configuration than before. All key data is stored on mirrored disks, and even less important data is stored in raid 5 (which requires 2 out of 3 disks to be working), and all important key stores are backed up on the other machine.

But if I had known how hard and long this simple activity would actually take, I would never have started it!