Big internet pipe and cloud saved my storage in crisis
The current storage solutions in AgResearch are all based on Network Attached Storage (NAS) technologies. It was simple, quick and cost effective to deploy. In some instances, it was even easy to scale up their capacities. However, individual fileservers have become data silos and we suffered from their limitations regularly. This talk is based on an incident caused one of those struggles. It also covers how we recovered from it quickly by utilising the Cloud, and our thoughts on our future storage platform.
Over one weekend in early October 2019, unexpected amount of data was placed on one of user accessible fileservers and pushed its utilisation over 85%. Consequently, its performance started to degrade. Unfortunately, there was no other storage which had enough spare capacity to offload this additional load in the same physical location.
We decided to remove some large datasets which had not been accessed by users for over 2 years to reclaim capacity quickly. At the same time, we had to maintain the same data protection level (two separated copies of the same data stored in two different locations). To achieve this objective, we uploaded a copy of such datasets’ offsite replicas to Microsoft Azure Blob storage before removing the original copy from the server. Additionally, we also configured the Cloud storage to automatically migrate data from the Cool tier to the Archive tier after data being in the cloud for 7 days. This significantly reduces the cost of storing data in the Cloud for the long term, although we acknowledge the additional cost and time for retrieving such data if that’s required. We deem the probability of such operation low and would only be necessary in a disaster recovery scenario.
We were extremely pleased by the performance of REANZ’s network when we were uploading data to Microsoft Azure’s instance in Australia. We were able to upload 2TB of data in just over 37 minutes, which translates to 7 Gbps per second in average. The speed of our WAN is 10 Gbps. It took us another 2 hours to remove the dataset on the fileserver where we were running out of capacity. Overall, it took us just less than 3 hours to stabilise this fileserver and we think it was a fairly good outcome. After the initial crisis was over, we uploaded further 6TB of data to the Cloud to reclaim capacity from the same fileserver. We plan to use the same approach whenever we encounter similar issues in the short term until we are able to replace our current generation storage solutions.
Almost all of our storage solutions will reach their end of life in the next 12 to 24 months, and we are currently planning a new generation storage platform to replace them. From all lessons we have learned to date, we think a scale out storage solution is much more fit for purpose than NASs or fileservers. Based on our uses of the Cloud, we start to see the value of Object stores, although we won’t be getting rid of unstructured data store, filesystems, any time soon. It is our ambition to integrate both by some smart software. We also think data replication is more practical and appropriate than the traditional backup/restore model for the amount of data volume we have to keep. Lastly, the possibility to replicate data to the Cloud is attractive, particularly the low-cost archival storage, but its high retrieval overhead (both time and cost) is a risk that needs to be further investigated and mitigated.
ABOUT THE AUTHOR
Dan is currently working for AgResearh as a HPC consultant and maintains a smallish Linux cluster and storage. He is passionate about helping researchers to do science by using advanced technologies. When he is not firefighting at work, he enjoys having barista made coffee, fancy burgers and donuts with his collaborators and friends.