Dear Avid, ACSR's and Avid shared storage users.
I'm going to describe a scenario I have just encountered that makes me wonder if I'm a massive idiot or if all these years 'we' missed a huge flaw in Avid's media indexer functionality.
First the simplified facts:
- ISIS 7000 suffers from an electonic component problem that can cause the ISB microcontrollers to fail booting if power is removed from an ISIS 7000 chassis. Avid will replace these controllers free of charge regardless of support contract status as this is considered a manufacturing fault. Kudos Avid! (I do however wonder if this policy to replace this these boards should be altered from 'when failed' to 'known to fail' but I can't estimate the costs)
- As probably everybody knows, the integrated media indexer on all media composer systems, standalone, using shared storage and in an interplay environment, continuously monitors the file count of local storage AvidMediaFiles folder and scans the storage if a mismatch is detected (or pmr & mdb files are not present) and if during the scan they detect a corrupt mxf/omf file they will move that file into the quarantine folder. This scanning process is a visible process.
- The integrated media indexer on Media composer systems with shared storage will do the same for the Avid MediaFiles\MXF\machinename.number folders in the ISIS/Nexis workspaces if no interplay components are installed. This is also a visible process.
- In an Interplay environent the central media indexer takes care of all workspace indexing. It does not initiate an automatic re-index or re-scan when changes occur on workspaces. However it has a feature called the Media Indexer Full Resync Time which by default is set at 01:00 at night. This process is only visible on the media indexer web page.
- Media Indexer relies on ISIS/Nexis file system notifications in detection of new files being created on the Avid shared storage. These notifications comes in three flavours, CREATED, UPDATED, DELETED. When a file is being created or updated, Media Indexer starts an index process for the file, providing the file is in a non-excluded folder. Media Indexer by default will ignore files in folders with names "temp", "creating" and "quarantined files", files created in these folders will be detected, but ignored.
The scenario:
An unlucky customer encounters an issue with his UPS and power fails at around 00:00 at night. PANIC!!! UPS techs, electricians and broadcast techs are called and those still on site are running around. At 00:15 slowly slowly power is restored to all equipment. Many will recognize this situation. PSU's not powered down for years failing, smoking capacitors, clocks needing setting, drives failing and raid sets (attempting) rebuilding and... ISIS 7x00 ISB microcontrollers failing. Disaster...
Before 01:00 all Avid servers are powered up and the system directors present the file system to all clients including the media indexers. But due to the multiple ISB failure most media is missing file parts. But no panic, the Avid storage support team is here to help 24/7 to get those failed ISB up and running a.s.a.p. However...
At 01:00 the media indexers perform the automated re-index and quarantine all media it considers corrupt. So it moves the files in their corrupt state. When after 24-48h the new ISB microcontroller boards arrive all ISB's are brought back up, 7 ISB's however remain in the unknown state. All of these 7 ISB's report one or both drives missing. 6 of these ISB's funtion normally after reseat, the 1 ISB does not play ball. The power cut has killed on of its drives but no panic hot Spare ISB(s) are available and the system was 'only' 90% full. Of course the ISB's perform their rectifying files when reinserted and a repairing mirrors because of the failed ISB's. To replace the failed ISB the spare is added and a redistribution starts. In the morning the redistribution finishes but 90% of the media is in quarantine and indeed trying several mxf's show they are corrupt.
Could this be prevented? Yes of course but is this procedure documented? Was it the quarantine action that killed the system directors ability to recover from this catastrophy, was it the repairing mirrors or the redistribution? All projects and other files outside of the Avid MediaFiles folders are 100% ok.
But this made me think (yes I know that's a bad idea). Why does the media indexer need to move corrupt media which is the last thing you want on any storage facing problems. For the local media indexer... let's skip that chapter for now.
In an interplay environment, why can't the media indexer just flag the mxf's as corrupt without touching them? Why not allow a manual retry after a user was able to restore his ISIS to 100%?
The automatic re-index is only documented in the Best practices documentation. An ability to disable it is not available only to change the time it happens.
As an ISIS and Interplay ACSR I believe I should have known the above and how to avoid it or I should have known how to deduct the prevention of such a catastrophy by putting the functional knowledge of the individual parts together. But even if I did, if the power restore to the system directors and media indexers happens 5-10 minutes before the automatic re-index is initiated... is that a case of bad luck?