Monday, February 7, 2011

High Availability Storage

I would like to make 2 TB or so available via NFS and CIFS. I am looking for a 2 (or more) server solution for high availability and the ability to load balance across the servers if possible. Any suggestions for clustering or high availability solutions?

This is business use, planning on growing to 5-10 TB over next few years. Our facility is almost 24 hours a day, six days a week. We could have 15-30 minutes of downtime, but we want to minimize data loss. I want to minimize 3 AM calls.

We are currently running one server with ZFS on Solaris and we are looking at AVS for the HA part, but we have had minor issues with Solaris (CIFS implementation doesn't work with Vista, etc) that have held us up.

We have started looking at

  • DRDB over GFS (GFS for distributed lock capability)
  • Gluster (needs client pieces, no native CIFS support?)
  • Windows DFS (doc says only replicates after file closes?)

We are looking for a "black box" that serves up data.

We currently snapshot the data in ZFS and send the snapshot over the net to a remote datacenter for offsite backup.

Our original plan was to have a 2nd machine and rsync every 10 - 15 min. The issue on a failure would be that ongoing production processes would lose 15 minutes of data and be left "in the middle". They would almost be easier to start from the beginning than to figure out where to pickup in the middle. That is what drove us to look at HA solutions.

  • Are you looking for an "enterprise" solution or a "home" solution? It is hard to tell from your question, because 2TB is very small for an enterprise and a little on the high end for a home user (especially two servers). Could you clarify the need so we can discuss tradeoffs?

  • I would recommend NAS Storage. (Network Attached Storage).

    HP has some nice ones you can choose from.

    http://h18006.www1.hp.com/storage/aiostorage.html

    as well as Clustered versions:

    http://h18006.www1.hp.com/storage/software/clusteredfs/index.html?jumpid=reg_R1002_USEN

    From Sev
  • These days 2TB fits in one machine, so you've got options, from simple to complex. These all presume linux servers:

    • You can get poor-man's HA by setting up two machines and doing a periodic rsync from the main one to the backup.
    • You can use DRBD to mirror one from the other at the block level. This has the disadvantage of being somewhat difficult to expand in the future.
    • You can use OCFS2 to cluster the disks instead, for future expandability.

    There are also plenty of commercial solutions, but 2TB is a bit small for most of them these days.

    You haven't mentioned your application yet, but if hot failover isn't necessary, and all you really want is something that will stand up to losing a disk or two, find a NAS that support RAID-5, at least 4 drives, and hotswap and you should be good to go.

    From pjz
  • There's two ways to go at this. The first is to just go buy a SAN or a NAS from Dell or HP and throw money at the problem. Modern storage hardware just makes all of this easy to do, saving your expertise for more core problems.

    If you want to roll your own, take a look at using Linux with DRBD.

    http://www.drbd.org/

    DRBD allows you to create networked block devices. Think RAID 1 across two servers instead of just two disks. DRBD deployments are usually done using Heartbeat for failover in case one system dies.

    I'm not sure about load balancing, but you might investigate and see if LVS can be used to load balance across your DRBD hosts:

    http://www.linuxvirtualserver.org/

    To conclude, let me just reiterate that you're probably going to save yourself a lot of time in the long run just forking out the money for a NAS.

    From bmdhacks
  • I assume from the body of your question is you're a business user? I purchased a 6TB RAID 5 unit from Silicon Mechanics and have it NAS attached and my engineer installed NFS on our servers. Backups performed via rsync to another large capacity NAS.

    From
  • Have a look at Amazon Simple Storage Service (Amazon S3)

    http://www.amazon.com/S3-AWS-home-page-Money/b/ref=sc_fe_l_2?ie=UTF8&node=16427261&no=3435361&me=A36L942TSJ2AJA

    -- This may be of interest re. High Availability

    Dear AWS Customer:

    Many of you have asked us to let you know ahead of time about features and services that are currently under development so that you can better plan for how that functionality might integrate with your applications. To that end, we are excited to share some early details with you about a new offering we have under development here at AWS -- a content delivery service.

    This new service will provide you a high performance method of distributing content to end users, giving your customers low latency and high data transfer rates when they access your objects. The initial release will help developers and businesses who need to deliver popular, publicly readable content over HTTP connections. Our goal is to create a content delivery service that:

    Lets developers and businesses get started easily - there are no minimum fees and no commitments. You will only pay for what you actually use. Is simple and easy to use - a single, simple API call is all that is needed to get started delivering your content. Works seamlessly with Amazon S3 - this gives you durable storage for the original, definitive versions of your files while making the content delivery service easier to use. Has a global presence - we use a global network of edge locations on three continents to deliver your content from the most appropriate location.

    You'll start by storing the original version of your objects in Amazon S3, making sure they are publicly readable. Then, you'll make a simple API call to register your bucket with the new content delivery service. This API call will return a new domain name for you to include in your web pages or application. When clients request an object using this domain name, they will be automatically routed to the nearest edge location for high performance delivery of your content. It's that simple.

    We're currently working with a small group of private beta customers, and expect to have this service widely available before the end of the year. If you'd like to be notified when we launch, please let us know by clicking here.

    Sincerely,

    The Amazon Web Services Team

    Stu Thompson : S3 does not have particularly fantastic availability. It is great in many ways, but does not fit the "high availability" requirement the OP is asking for.
    From pro
  • Your best bet maybe to work with experts who do this sort of thing for a living. These guys are actually in our office complex...I've had a chance to work with them on a similar project I was lead on.

    http://www.deltasquare.com/About

    From ben
  • I've recently deployed hanfs using DRBD as the backend, in my situation, I'm running active/standby mode, but I've tested it successfully using OCFS2 in primary/primary mode too. There unfortunately isn't much documentation out there on how best to achieve this, most that exists is barely useful at best. If you do go along the drbd route, I highly recommend joining the drbd mailing list, and reading all of the documentation. Here's my ha/drbd setup and script I wrote to handle ha's failures:


    DRBD8 is required - this is provided by drbd8-utils and drbd8-source. Once these are installed (I believe they're provided by backports), you can use module-assistant to install it - m-a a-i drbd8. Either depmod -a or reboot at this point, if you depmod -a, you'll need to modprobe drbd.

    You'll require a backend partition to use for drbd, do not make this partition LVM, or you'll hit all sorts of problems. Do not put LVM on the drbd device or you'll hit all sorts of problems.

    Hanfs1:

    
    /etc/drbd.conf
    
    global {
            usage-count no;
    }
    common {
            protocol C;
            disk { on-io-error detach; }
    }
    resource export {
            syncer {
                    rate 125M;
            }
            on hanfs2 {
                    address         172.20.1.218:7789;
                    device          /dev/drbd1;
                    disk            /dev/sda3;
                    meta-disk       internal;
            }
            on hanfs1 {
                    address         172.20.1.219:7789;
                    device          /dev/drbd1;
                    disk            /dev/sda3;
                    meta-disk       internal;
           }
    }
    Hanfs2's /etc/drbd.conf:
    
    global {
            usage-count no;
    }
    common {
            protocol C;
            disk { on-io-error detach; }
    }
    resource export {
            syncer {
                    rate 125M;
            }
            on hanfs2 {
                    address         172.20.1.218:7789;
                    device          /dev/drbd1;
                    disk            /dev/sda3;
                    meta-disk       internal;
            }
            on hanfs1 {
                    address         172.20.1.219:7789;
                    device          /dev/drbd1;
                    disk            /dev/sda3;
                    meta-disk       internal;
           }
    }
    Once configured, we need to bring up drbd next.
    drbdadm create-md export
    drbdadm attach export
    drbdadm connect export
    

    We must now perform an initial synchronization of data - obviously, if this is a brand new drbd cluster, it doesn't matter which node you choose.

    Once done, you'll need to mkfs.yourchoiceoffilesystem on your drbd device - the device in our config above is /dev/drbd1. http://www.drbd.org/users-guide/p-work.html is a useful document to read while working with drbd.

    Heartbeat

    Install heartbeat2. (Pretty simple, apt-get install heartbeat2).

    /etc/ha.d/ha.cf on each machine should consist of:

    hanfs1:

    
    logfacility local0
    keepalive 2
    warntime 10
    deadtime 30
    initdead 120
    
    ucast eth1 172.20.1.218
    
    auto_failback no
    
    node hanfs1
    node hanfs2
    
    hanfs2:
    
    logfacility local0
    keepalive 2
    warntime 10
    deadtime 30
    initdead 120
    
    ucast eth1 172.20.1.219
    
    auto_failback no
    
    node hanfs1
    node hanfs2
    
    /etc/ha.d/haresources should be the same on both ha boxes:
    hanfs1 IPaddr::172.20.1.230/24/eth1
    hanfs1  HeartBeatWrapper

    I wrote a wrapper script to deal with the idiosyncracies caused by nfs and drbd in a failover scenario. This script should exist within /etc/ha.d/resources.d/ on each machine.

    
    #!/bin/bash                                         
    
    #heartbeat fails hard.
    #so this is a wrapper    
    #to get around that stupidity
    #I'm just wrapping the heartbeat scripts, except for in the case of umount
    #as they work, mostly                                                     
    
    if [[ -e /tmp/heartbeatwrapper ]]; then
        runningpid=$(cat /tmp/heartbeatwrapper)
        if [[ -z $(ps --no-heading -p $runningpid) ]]; then
            echo "PID found, but process seems dead.  Continuing."
        else                                                      
            echo "PID found, process is alive, exiting."          
            exit 7                                                
        fi                                                        
    fi                                                            
    
    echo $$ > /tmp/heartbeatwrapper
    
    if [[ x$1 == "xstop" ]]; then
    
    /etc/init.d/nfs-kernel-server stop #>/dev/null 2>&1
    
    #NFS init script isn't LSB compatible, exit codes are 0 no matter what happens.
    #Thanks guys, you really make my day with this bullshit.                       
    #Because of the above, we just have to hope that nfs actually catches the signal
    #to exit, and manages to shut down its connections.                             
    #If it doesn't, we'll kill it later, then term any other nfs stuff afterwards.  
    #I found this to be an interesting insight into just how badly NFS is written.  
    
    sleep 1
    
        #we don't want to shutdown nfs first!
        #The lock files might go away, which would be bad.
    
        #The above seems to not matter much, the only thing I've determined
        #is that if you have anything mounted synchronously, it's going to break
        #no matter what I do.  Basically, sync == screwed; in NFSv3 terms.      
        #End result of failing over while a client that's synchronous is that   
        #the client hangs waiting for its nfs server to come back - thing doesn't
        #even bother to time out, or attempt a reconnect.                        
        #async works as expected - it insta-reconnects as soon as a connection seems
        #to be unstable, and continues to write data.  In all tests, md5sums have   
        #remained the same with/without failover during transfer.                   
    
        #So, we first unmount /export - this prevents drbd from having a shit-fit
        #when we attempt to turn this node secondary.                            
    
        #That's a lie too, to some degree. LVM is entirely to blame for why DRBD
        #was refusing to unmount.  Don't get me wrong, having /export mounted doesn't
        #help either, but still.                                                     
        #fix a usecase where one or other are unmounted already, which causes us to terminate early.
    
        if [[ "$(grep -o /varlibnfs/rpc_pipefs /etc/mtab)" ]]; then                                 
            for ((test=1; test /dev/null 2>&1                                
                if [[ -z $(grep -o /varlibnfs/rpc_pipefs /etc/mtab) ]]; then                        
                    break                                                                           
                fi                                                                                  
                if [[ $? -ne 0 ]]; then                                                             
                    #try again, harder this time                                                    
                    umount -l /var/lib/nfs/rpc_pipefs  >/dev/null 2>&1                              
                    if [[ -z $(grep -o /varlibnfs/rpc_pipefs /etc/mtab) ]]; then                    
                        break                                                                       
                    fi                                                                              
                fi                                                                                  
            done                                                                                    
            if [[ $test -eq 10 ]]; then                                                             
                rm -f /tmp/heartbeatwrapper                                                         
                echo "Problem unmounting rpc_pipefs"                                                
                exit 1                                                                              
            fi                                                                                      
        fi                                                                                          
    
        if [[ "$(grep -o /dev/drbd1 /etc/mtab)" ]]; then                                            
            for ((test=1; test /dev/null 2>&1                                                     
                if [[ -z $(grep -o /dev/drbd1 /etc/mtab) ]]; then                                   
                    break                                                                           
                fi                                                                                  
                if [[ $? -ne 0 ]]; then                                                             
                    #try again, harder this time                                                    
                    umount -l /export  >/dev/null 2>&1                                              
                    if [[ -z $(grep -o /dev/drbd1 /etc/mtab) ]]; then                               
                        break                                                                       
                    fi                                                                              
                fi                                                                                  
            done                                                                                    
            if [[ $test -eq 10 ]]; then                                                             
                rm -f /tmp/heartbeatwrapper                                                         
                echo "Problem unmount /export"                                                      
                exit 1                                                                              
            fi                                                                                      
        fi                                                                                          
    
    
        #now, it's important that we shut down nfs. it can't write to /export anymore, so that's fine.
        #if we leave it running at this point, then drbd will screwup when trying to go to secondary.  
        #See contradictory comment above for why this doesn't matter anymore.  These comments are left in
        #entirely to remind me of the pain this caused me to resolve.  A bit like why churches have Jesus
        #nailed onto a cross instead of chilling in a hammock.                                           
    
        pidof nfsd | xargs kill -9 >/dev/null 2>&1
    
        sleep 1                                   
    
        if [[ -n $(ps aux | grep nfs | grep -v grep) ]]; then
            echo "nfs still running, trying to kill again"   
            pidof nfsd | xargs kill -9 >/dev/null 2>&1       
        fi                                                   
    
        sleep 1
    
        /etc/init.d/nfs-kernel-server stop #>/dev/null 2>&1
    
        sleep 1
    
        #next we need to tear down drbd - easy with the heartbeat scripts
        #it takes input as resourcename start|stop|status                
        #First, we'll check to see if it's stopped                       
    
        /etc/ha.d/resource.d/drbddisk export status >/dev/null 2>&1
        if [[ $? -eq 2 ]]; then                                    
            echo "resource is already stopped for some reason..."  
        else                                                       
            for ((i=1; i /dev/null 2>&1
                if [[ $(egrep -o "st:[A-Za-z/]*" /proc/drbd | cut -d: -f2) == "Secondary/Secondary" ]] || [[ $(egrep -o "st:[A-Za-z/]*" /proc/drbd | cut -d: -f2) == "Secondary/Unknown" ]]; then                                                                                                                             
                    echo "Successfully stopped DRBD"                                                                                                             
                    break                                                                                                                                        
                else                                                                                                                                             
                    echo "Failed to stop drbd for some reason"                                                                                                   
                    cat /proc/drbd                                                                                                                               
                    if [[ $i -eq 10 ]]; then                                                                                                                     
                            exit 50                                                                                                                              
                    fi                                                                                                                                           
                fi                                                                                                                                               
            done                                                                                                                                                 
        fi                                                                                                                                                       
    
        rm -f /tmp/heartbeatwrapper                                                                                                                              
        exit 0                                                                                                                                                   
    
    elif [[ x$1 == "xstart" ]]; then
    
        #start up drbd first
        /etc/ha.d/resource.d/drbddisk export start >/dev/null 2>&1
        if [[ $? -ne 0 ]]; then                                   
            echo "Something seems to have broken. Let's check possibilities..."
            testvar=$(egrep -o "st:[A-Za-z/]*" /proc/drbd | cut -d: -f2)       
            if [[ $testvar == "Primary/Unknown" ]] || [[ $testvar == "Primary/Secondary" ]]
            then                                                                           
                echo "All is fine, we are already the Primary for some reason"             
            elif [[ $testvar == "Secondary/Unknown" ]] || [[ $testvar == "Secondary/Secondary" ]]
            then                                                                                 
                echo "Trying to assume Primary again"                                            
                /etc/ha.d/resource.d/drbddisk export start >/dev/null 2>&1                       
                if [[ $? -ne 0 ]]; then                                                          
                    echo "I give up, something's seriously broken here, and I can't help you to fix it."
                    rm -f /tmp/heartbeatwrapper                                                         
                    exit 127                                                                            
                fi                                                                                      
            fi                                                                                          
        fi                                                                                              
    
        sleep 1                                                                                         
    
        #now we remount our partitions                                                                  
    
        for ((test=1; test /tmp/mountoutput                                                  
            if [[ -n $(grep -o export /etc/mtab) ]]; then                                               
                break                                                                                   
            fi                                                                                          
        done                                                                                            
    
        if [[ $test -eq 10 ]]; then                                                                     
            rm -f /tmp/heartbeatwrapper                                                                 
            exit 125                                                                                    
        fi                                                                                              
    
    
        #I'm really unsure at this point of the side-effects of not having rpc_pipefs mounted.          
        #The issue here, is that it cannot be mounted without nfs running, and we don't really want to start
        #nfs up at this point, lest it ruin everything.                                                     
        #For now, I'm leaving mine unmounted, it doesn't seem to cause any problems.                        
    
        #Now we start up nfs.
    
        /etc/init.d/nfs-kernel-server start >/dev/null 2>&1
        if [[ $? -ne 0 ]]; then
            echo "There's not really that much that I can do to debug nfs issues."
            echo "probably your configuration is broken.  I'm terminating here."
            rm -f /tmp/heartbeatwrapper
            exit 129
        fi
    
        #And that's it, done.
    
        rm -f /tmp/heartbeatwrapper
        exit 0
    
    elif [[ "x$1" == "xstatus" ]]; then
    
        #Lets check to make sure nothing is broken.
    
        #DRBD first
        /etc/ha.d/resource.d/drbddisk export status >/dev/null 2>&1
        if [[ $? -ne 0 ]]; then
            echo "stopped"
            rm -f /tmp/heartbeatwrapper
            exit 3
        fi
    
        #mounted?
        grep -q drbd /etc/mtab >/dev/null 2>&1
        if [[ $? -ne 0 ]]; then
            echo "stopped"
            rm -f /tmp/heartbeatwrapper
            exit 3
        fi
    
        #nfs running?
        /etc/init.d/nfs-kernel-server status >/dev/null 2>&1
        if [[ $? -ne 0 ]]; then
            echo "stopped"
            rm -f /tmp/heartbeatwrapper
            exit 3
        fi
    
        echo "running"
        rm -f /tmp/heartbeatwrapper
        exit 0
    fi
    
    With all of the above done, you'll then just want to configure /etc/exports
    /export 172.20.1.0/255.255.255.0(rw,sync,fsid=1,no_root_squash)

    Then it's just a case of starting up heartbeat on both machines and issuing hb_takeover on one of them. You can test that it's working by making sure the one you issued the takeover on is primary - check /proc/drbd, that the device is mounted correctly, and that you can access nfs.

    --

    Best of luck man. Setting it up from the ground up was, for me, an extremely painful experience.

    Kent Fredric : There needs to be a sort of badge out there for this sort of post.
    From Tony Dodd
  • May I suggest you visit the F5 site and check out http://www.f5.com/solutions/virtualization/file/

    From jm04469
  • You can look at Mirror File System. It does the file replication on file system level. The same file on both primary and backup systems are live file.

    http://www.linux-ha.org/RelatedTechnologies/Filesystems

    From fish.ada94

0 comments:

Post a Comment