I'm staging to a D2D 4106i appliance via a media server running CentOS 4, the D2D is mounted as an NFS share and a file library is put on top of that. In the last weeks, I witnessed two occurrences of something like that:
[Normal] From: VBDA@host.domain.net "host.domain.net [/wwwhome]" Time: 28.06.2011 03:16:09 COMPLETED Disk Agent for host.domain.net:/wwwhome "host.domain.net [/wwwhome]".
[Minor] From: BMA@alucard.backup "D2D-FileLib-DB_Writer0" Time: 28.06.2011 03:16:10 [90:56] /storage-d2d-1/FileLibrary/02fea8c054e09160c505bc504bc.fd Cannot close device ( Stale NFS file handle)
Please note that the Disk Agent seems to complete normally and successfully (that also causes DP to consider the backup of this particular object a success). The file depot in question, however, seems to be broken after the event. Verifying it is instantly aborted, the debug log shows severe internal inconsistencies and of course it cannot be copied.
The question is obviously: Could this be a known issue with either the D2D appliances (It happened before and after the upgrade to 2.2.0 latest firmware) or CentOS 4? A stale NFS filehandle is usually not caused by just a LAN glitch AFAIK, as NFS makes painfully sure to reconnect to the server, blocking everything else if need be. The mount is hard, but interruptible:
I've pulled mount options out of my fingers, as the D2D guides don't give any detail on mount options to use beyond "mount nashost:/nas/share /somewhere". There is absolutely nothing in either the CentOS hosts system logs or that of the D2D...
However, I do connect to our D2D by a CIFS share or Windows share if you prefeer.
I do connect from a linux box running on Debian with a command like :
mount -t cifg -o usernama=user,password=pass //d2d_device/share_name /mnt/some_mount_point.
This works fine, however, after writing for a couple of minuts or some amount of data (this is never the same), I start to get IO error like you do, then I am not able to write to that ''share'' anymore unless I delete that share on the D2D and create it back again...
May be I'm dooing it wrong, but this post could be related... May be...
(any way to cross-reference other threads in the forums internally?). The difference between the case here and the case described there is just when it strikes: upon reading or upon writing.
The common factor seems to be that DP first opens a bunch of files on the NFS share (the virtual cartridges of the File Library). In case of a backup job, they will be created anew for writing, and a header will be written which results in a file of size 270336 in my case (probably due to a 256KiB block size being in use). In case of a copy job, the already existing file will be opened for reading.
Now the interesting part: The disaster seems to strike only when the open file is idle for some considerable time between creation/opening and the next real access. In my case, this might happen due to the copy job opening all the source media when it starts, but then copying every session individually to tape (I don't want to multiplex on tape when stuff is already demuxed on disk). Naturally, when an object takes a while to copy, all the others have to wait. The chance of the next read on an open file to fail increases with time, it sometimes happens after 10-15min, it more likely happens after 30-45min. The backup case is essentially the same when a longer wait occurs between creating the new file and finally starting to write data to it, which in my case is only triggered by a certain file system that takes 45min to traverse (plenty of small files), so there is always some 45min between the file creation (and header writing) and the next write access to that file. Now the most interesting fact here is that the DP media agent thinks it is writing to that file quite happily, for all the 20GB or such that it usually writes on an incremental of that file system - only when finally closing the file, it fails with the "NFS stale file handle" errno and the file on disk is still 270336 bytes in size. I'm quite curious on where all that data is buffered without actually being either written out or triggering a write error, there's no buffer cache with that capacity on my CentOS source system for sure (the D2D has 96GiB of RAM, though).
The whole issue makes my D2D backups unstable enough so they didn't go into real production use yet. I wonder why that isn't happening to more users. The CIFS issue described here may be related, but it looks quite different from the NFS one IMO. Specifically, if I had to delete the D2D share, I'd have gone insane already, as that's clearly rendering the whole thing unusable. For me, it just trashes a single virtual medium roughly once a week, causing some additional manual work.
Are there actually users of the new HP D2D appliances out there who make use of NAS share mode? Is NAS share mode actually tested for compatibility with all the gazillions of NFS client implementations out there? Let alone CIFS?
Thanks for the additional info. What I wanted, though, is a way to cross-reference other threads or posts in an internal way that doesn't contain the entire URL. Using an URL in full text for something that isn't off-site feels ugly. It also poses a lot of problems when converting to another forum backend, as seen when ITRC moved to lithium. The hostname feels strange (you never know if it is just one of many and about to change tomorrow). But I see that URLs probably are the thing to go anyway.
warming up the thread as I'm seeing this in the D2D G2 Software 2.2.15 CR announcement:
Improve NAS share backup under load: There were instances during heavy load where backups to NAS shares were slow and failed. Shares sometimes became unresponsive. This was observed mostly during SQL Backups. A fix has been implemented to improve handling of NAS traffic under load.
Anyone here in the know if that is NFS releated? The fundamental CIFS issues discussed a little upthread have been fixed with 2.2.12 AFAIK, but my NFS woes are still there. It doesn't really sound like my problem, but who knows. Maybe I should try to enter that CR as a tester...