OCFS2 Support Guide Linux/Solaris

Posted By Sagar Patil

This support guide is a supplement to the OCFS2 User’s Guide and the OCFS2 FAQ. The information provided is directed towards support. End users should consult theUsers’ Guide and/or FAQ for information on setting up and using OCFS2.

# rpm -qa | grep ocfs2
ocfs2-2.6.9-22.0.1.ELsmp-1.2.1-1
ocfs2console-1.2.1-1
ocfs2-tools-1.2.1-1

# uname -a
Linux node01 2.6.9-22.0.1.ELsmp #1 SMP Fri Dec 2 15:52:15 PST 2005 i686 i686 i386 GNU/Linux

ocfs2-2.6.9-22.0.1.ELsmp-1.2.1-1 is the kernel module rpm. Ensure it matches the kernel version. If it does, the correct ones have been installed. To check whether the modules are being located or not, we first need to ensure the files exist. To list the contents of the package, do:

# rpm -ql ocfs2-2.6.9-22.0.1.ELsmp-1.2.1-1
/lib/modules/2.6.9-22.0.1.ELsmp/kernel/fs
/lib/modules/2.6.9-22.0.1.ELsmp/kernel/fs/configfs
/lib/modules/2.6.9-22.0.1.ELsmp/kernel/fs/configfs/configfs.ko
/lib/modules/2.6.9-22.0.1.ELsmp/kernel/fs/debugfs
/lib/modules/2.6.9-22.0.1.ELsmp/kernel/fs/debugfs/debugfs.ko
/lib/modules/2.6.9-22.0.1.ELsmp/kernel/fs/ocfs2
/lib/modules/2.6.9-22.0.1.ELsmp/kernel/fs/ocfs2/ocfs2.ko
/lib/modules/2.6.9-22.0.1.ELsmp/kernel/fs/ocfs2/ocfs2_dlm.ko
/lib/modules/2.6.9-22.0.1.ELsmp/kernel/fs/ocfs2/ocfs2_dlmfs.ko
/lib/modules/2.6.9-22.0.1.ELsmp/kernel/fs/ocfs2/ocfs2_nodemanager.ko

Use find, ls, or any other tool, to ensure all the files exist. Also check whether the same module(s) exists in a different location under “/lib/modules/`uname -r`” directory. modprobe will pick up the first matching module it finds and it may not be the one we want it to load. If you find a matching module in another directory, find the package that module was bundled with.
# rpm -qf /path/to/rogue/module

If it lists a package name, investigate how a matching module could be shipped in an another package. “rpm –erase <package>” may be inorder. On the other hand, if that comes up empty, then one has to assume it was hand copied. Either move it or delete it. If the files are ok, the next step is to rebuild the module dependencies.
# depmod -a

Try to load the modules again. If it still fails attempt to do the various loads manually.
# modprobe configfs
# mount -t configfs configfs /config
# modprobe ocfs2_nodemanager
# modprobe ocfs2_dlm
# modprobe ocfs2_dlmfs
# mount -t ocfs2_dlmfs ocfs2_dlmfs /dlm

It just could be that /bin is not in the PATH. As always, keep checking dmesg for any errors. If the error is originating in kernel space, a more descriptive message will be listed in dmesg.

Starting the Cluster
OCFS2 bundles a cluster stack referred to as O2CB.
– One needs to start O2CB before one can mount OCFS2 volumes.
– /etc/init.d/o2cb is the init script which is used to start and stop the cluster.
– However, before the cluster can be started, /etc/ocfs2/cluster.conf needs to be populated with the list of all nodes in the cluster.
– Also, the default cluster name needs to be specified in /etc/sysconfig/o2cb. This can be done by hand or by running “/etc/init.d/o2cb configure”.
– Refer to the user’s guide for the steps to populate cluster.conf.
– The common errors include specifying incorrect IP addresses and node names. While a public IP can be used, it is not recommended. One should use private IPs for security and for lower latency.
– The node name must match the hostname for which that node entry is for. Once populated, one should ensure that the file is the same on all nodes in the the cluster. An easy check is comparing the md5 checksums of the file on each node.
– # md5sum /etc/ocfs2/cluster.conf The checksum should be the same on all the nodes.
– Eventhough O2CB only expects the config file to be semantically the same and not necessarily have matching checksums, having the exact same file on all nodes makes cluster management easier.

To start the cluster, do: # /etc/init.d/o2cb online

The o2cb init script calls o2cb_ctl to read cluster.conf and populate /config/cluster/<clustername>/node with the information. When ocfs2_ctl detects a nodename in the config file matching the hostname of the node, it populates 1 in /config/cluster/<clustername>/node/<nodename>/local. The nodemanager in kernel, inturn, launches the [o2net] thread. The is cluster now online. Keep in mind the nodemanager’s view of the cluster members is only what is populated in /config/cluster/<clustername>/node. If one edits cluster.conf after the cluster is started, the new configuration will not take effect until the cluster is restarted.

Stopping the Cluster
One cannot stop the cluster if there are any active heartbeat regions, which indicate mounted volume(s). In other words, one can stop a cluster only if there are no entries in /config/cluster/<clustername>/heartbeat.

To stop the cluster, # /etc/init.d/o2cb offline

The o2cb init script calls o2cb_ctl which reads cluster.conf and removes the corresponding entries in /config/cluster/<clustername>/node. It should be noted that o2cb_ctl only removes entries listed in cluster.conf. So, if one were to manually edit the config file while the cluster was online, o2cb may not be able to clean up all the entries in /config. The cluster may even indicate that it is stopped but some modules will not unload. To understand how to fix such an issue, let us unload the modules manually.
# lsmod | egrep “ocfs|config”
ocfs2_dlmfs 25864 1
ocfs2_dlm 210056 1 ocfs2_dlmfs
ocfs2_nodemanager 178384 103 ocfs2_dlmfs,ocfs2_dlm
configfs 26764 2 ocfs2_nodemanager
Notice the reference counts in the 3rd column. ocfs2_dlmfs has 1. That is because dlmfs is mounted at /dlm.
# mount | grep ocfs2_dlm
ocfs2_dlmfs on /dlm type ocfs2_dlmfs (rw)
umount /dlm manually and the reference count drops to 0.
# umount /dlm/
# lsmod | egrep “ocfs|config”
ocfs2_dlmfs 25864 0
ocfs2_dlm 210056 1 ocfs2_dlmfs
ocfs2_nodemanager 178384 103 ocfs2_dlmfs,ocfs2_dlm
configfs 26764 2 ocfs2_nodemanager
One can unload inactive modules using rmmod.
# rmmod ocfs2_dlmfs
# rmmod ocfs2_dlm
# lsmod | egrep “ocfs|config”
ocfs2_nodemanager 178384 101
configfs 26764 2 ocfs2_nodemanager
The node manager reference counts are due to cluster information populated in
/config.
# ls /config/cluster/<clustername>/node/ | wc -l
100
# ls /config/cluster/ | wc -l
1
In this case, 100 nodes and 1 cluster. To remove entries by hand, do:
# rmdir /config/cluster/<clustername>/node/*
# rmdir /config/cluster/*
# lsmod | egrep “ocfs|config”
ocfs2_nodemanager 178384 0
configfs 26764 2 ocfs2_nodemanager
# rmmod ocfs2_nodemanager
# lsmod | egrep “ocfs|config”
configfs 26764 1
# mount | grep configfs
configfs on /config type configfs (rw)
Like dlmfs, configfs also needs to be umounted.
# umount /config
# rmmod configfs
# lsmod | egrep “ocfs|config”
If one were to edit cluster.conf while the cluster was up and then tried to shutdown the cluster, it may fail. The success or failure depends on the change made. If the user only added new nodes, then it would work. However, if the user deleted existing nodes, one would need to manually remove the entries from /config as listed above. In any of these cases, rebooting the node would also “fix” the problem.

Leave a Reply

You must be logged in to post a comment.

Top of Page

Top menu