RFC - bootscript error reporting

Bill's LFS Login lfsbill at nospam.dot
Fri Jan 30 10:00:35 PST 2004


On Thu, 29 Jan 2004, IvanK. wrote:

> Sorry for the top-post.
>
> All of the suggested ideas are good.
>
> Now, I'll throw in a new one in (didn't say it's a good one):  an initrd with
> enough utilities to (try to) "auto-correct" a broken lfs?
>
> The scenario is as follows:
> The system is not shutdown properly, it reboots and tries to fsck itself, but
> fails.  It writes a /fatal (or something like that) file, and reboots itself,
> after running lilo -R "linux initrd=/boot/<initrd.image>

The better solution is to boot into a recovery procedure as the default.
If it sees there is problem, it does the recovery processes and then
either reboots or just does the pivot_root things and comes up to
production status.

If it sees no problem, just do pivot_root stuff and transition to
production states.

The time lost in booting right into the initrd process will be *very*
small when there is no problem and the savings will be large when there
is a problem.

> (Yeah, I'm still using lilo!  This can be adapted for grub, can't it?).  Only
> problem right now is that if you have a lilo password (shame on you if you
> don't :-) ), you will be prompted for a password.  This defeats the whole
> purpose of the exercise.  I wonder if I pass 'bypass' to lilo -R if it'll
> bypass the password prompt...  gotta try that.
>
> After the reboot, the initrd does the fsck thingie.  For example something
> like this:
>
> for disk in /dev/[h,s]d[a-h];do for partition in `fdisk -l $disk | grep "Linux
> $" | awk '{print $1}'`;do echo -ne "Checking $partition... " && echo -ne
> "fsck -a -C -T $partition\n";done;done
>
> (get rid of the second echo to run the fsck.  This is only to verify the
> command)
>
> Extending this idea, /fatal could be a directory with files in it with enough
> information for the rescue mini-system to correct the problems.

Ideally, you want this FS to be unknown to the "partition tables". Just
make it some unused space occupying known extents on the HD. Attach a
loopback device to that extent (possible by using the whole drive
specification and the offset parameter to the losetup utility) fsck it,
mount it and you are ready to proceed (assuming the fsck was ok).

>
> If we want to be clever, we can even check if all modules in /etc/modules.conf
> are present in /lib/modules/`uname -r` (especially the driver for eth0), etc.
> A whole world of possibilities?
>
> Now as far as logging goes, I'm of the opinion we should be capturing as much
> as possibe of every warning/corrected problem/uncorrected problem to, say, /
> var/log/init.log.  But this is tricky because if /var could not be mounted
> rw, where do we write the log? Perhaps write to the ram disk, and if it can
> correct /var (or /), mount it rw and dump the log into place?

Worse possibility is that FS is corrupted and you start writing to it. I
would use the swap area instead as storage (it has only one block that
is anything like a "format" and is easily recovered after we finish
writing over it - or you could just not step on that block). If you can
compress output as it is written, a *very* large number of messages can
be stored in typical swap sizes.

>
> This could be the ravings of a lunatic, but it is *possible*

It is the ravings of a lunatic by the very fact that you even consider
such things. Not optional.

But quite possible.. This is fairly similar to work I did for IBM on my
last contract. The major difference is we had boot failover timers and
supporting hardware monitors available with custom BIOS to support what
we wanted.

But rather than using the HD area for the stuff you mention, we tried to
reboot from CD-ROM and if that failed we rebooted from floppy. There was
some info stored on the HD *outside of any partition* that would be used
to restore some node-specific info if the HD was working and only the FS
or partition tables were corrupted. Recovery of partitions and file
systems was done through calculations, based on HD size, within
/linuxrc, which re-established partitions, made the file systems and
loaded data common to all nodes from CD (if available) or another node
on the cluster network if CD failed.

Once the drive was operational, the node rejoined the cluster and became
a good community member again.

>
> And on another note, I retract my question regarding rhgb.  Even though I've
> made some "progress" in porting it to lfs, I realize it's better dealt with
> in a hint than in lfs-book, or even blfs-book.
>
> IvanK.
>
> On Thursday 29 January 2004 03:47 am, Jeremy Utley wrote:
> > On Wed, 2004-01-28 at 15:11, James Robertson wrote:
> > ><snip bunch of good ideas suggested earlier in the thread>

-- 
NOTE: I'm on a new ISP, if I'm in your address book ...
Bill Maltby
lfsbillATearthlinkDOTnet
Fix line above & use it to mail me direct.



More information about the lfs-dev mailing list