User Guidance After Migrating to the New DDN Storage System

From TAMUQ Research Computing User Documentation Wiki
Jump to navigation Jump to search


What is this about?

Research Computing has recently acquired and installed a new HPC storage system for raad2. It was imperative to replace our old storage, which was an end-of-life product, in order to continue to provide a production level service for our users. Transitioning to the new storage means that we (RC staff) have replicated all data from the current system to the new one. We need you to be aware that we have completed this migration, and we also need some help from you to facilitate the process.

When will it happen?

All data migration HAS BEEN COMPLETED during the system downtime window from May 20th to 25th. When you log in to the system after this downtime, you will be using your home directory on the new storage system.

What benefit do I get out of the new storage?

At this time, the main benefit of the new storage to end users will be the continuance of a reliable and stable storage service. End users will also benefit indirectly from the better manageability and monitoring capabilities the system provides to system managers. We will, for instance, be able to implement directory quotas on the new system in a much better way than was possible on the old system. We will also be able to track per-job file system utilization for better IO diagnostics, and generally have access to enhanced system usage analytics.

In terms of performance and capacity, we have deliberately kept the new storage at roughly the same levels as the old system due to funding constraints in 2021, when the procurement decision was made. Looking forward to 2023, however, we do anticipate an upgrade to this new system to enhance its IO performance and capacity. This will enable the new storage to service the needs of our next HPC machine as well.

Why do I need to know?

Because you will almost certainly need to make minor changes to you existing job files in order to start using the new storage. Your job files may have explicit references to file or directory locations in your old home directory, and these must be changed. More about that is mentioned below.

How was the migration implemented?

The old system used to present itself to the raad2 supercomputer as the directory /lustre. All our shared software applications, user home directories, project directories, etc resided within this /lustre file system. In order to migrate all data to the new storage we mirrored the directory structure under /lustre on to the new storage system, but presented it to raad2 under a top-level directory called /ddn instead. What that means practically is that where the absolute path of your home directory on the old storage began with /lustre/home/, on the new storage your home directory now resides under /ddn/home/ instead.

We replicated all data using the well-known rsync tool from its location under /lustre to its analogous location under /ddn.

Why do I see files that I previously deleted in my directory once again?

Because of the way in which this migration was implemented, files that users may have deleted from their home directories on the old storage system sometime between March and May might "re-appear" in their new home directory under /ddn/home. This should not be a reason for concern, although it will be an inconvenience to identify and delete these files once again.

Even if the above doesn't apply in your case, the following general advice should still be heeded: please delete all unneeded or obsolete files from your home directories. Directory quotas of 100GB per home directory will be introduced soon. Users with higher volumes of data will be accommodated with secondary home directories were additional data may be stored. More details on this, as well as the rationale for this new scheme will be forthcoming. In the meantime, it will help everyone if data that is no longer needed is removed from the system.

Here is one way to create a listing of large (in this case > 1GB in size) files in your home directory to assist in this endeavor.

fachaud74@raad2b:~> find . -size +1G -exec ls -lh {} \;
-rw-r--r-- 1 fachaud74 rc.users 8.0G Jun  1  2018 ./misimoe83/bus_errors/#md2101w.trr.3#
-rw-r--r-- 1 fachaud74 rc.users 8.0G Jun  6  2018 ./misimoe83/bus_errors/md2101w.trr
-rw-r--r-- 1 fachaud74 rc.users 3.7G May 13  2018 ./n3.8.1/WRFV3/test/em_real/june4_6/no_d_sst/wrfout_d01_2016-06-04_00:00:00
-rw-r--r-- 1 fachaud74 rc.users 3.1G May 13  2018 ./n3.8.1/WRFV3/test/em_real/june4_6/no_d_sst/wrfout_d02_2016-06-04_00:00:00
-rw-r--r-- 1 fachaud74 rc.users 2.4G May 17  2018 ./n3.8.1/WRFV3/test/em_real/wrfout_d01_2016-06-04_00:00:00
-rw-r--r-- 1 fachaud74 rc.users 3.1G May 13  2018 ./n3.8.1/WRFV3/test/em_real/wrfout_d02_2016-06-04_00:00:00
-rw-r--r-- 1 fachaud74 rc.users 3.5G May 13  2018 ./n3.8.1/WRFV3/test/em_real/wrfout_d02_2016-04-11_18:00:00
-rw-r--r-- 1 fachaud74 rc.users 4.1G May 13  2018 ./n3.8.1/WRFV3/test/em_real/wrfout_d01_2016-04-11_18:00:00
-rw-r--r-- 1 fachaud74 rc.users 2.3G Aug 30  2021 ./FreeFEM++/freefem-docker.tar
-rw-r--r-- 1 fachaud74 rc.users 1.3G Sep 26  2017 ./muhamza47/silica1nm/md.trr
-rwxr-xr-x 1 fachaud74 rc.users 2.2G May 27  2018 ./kokakos14/32bit_compile/CALPUFF_v7.2.1_L150618/calpuff_R5
-rw-r--r-- 1 fachaud74 rc.users 2.2G May 27  2018 ./kokakos14/32bit_compile/CALPUFF_v7.2.1_L150618/calpuff.o
-rw-r--r-- 1 fachaud74 rc.users 3.4G Apr 25  2018 ./cle6_upgrade/sleupdate-12sp3%2b180209-201802091328.iso
-rw-r--r-- 1 fachaud74 rc.users 2.1G Apr 25  2018 ./cle6_upgrade/SLE-12-SP3-WE-DVD-x86_64-GM-DVD1.iso
-rw-r--r-- 1 fachaud74 rc.users 3.6G Apr 25  2018 ./cle6_upgrade/SLE-12-SP3-Server-DVD-x86_64-GM-DVD1.iso
-rw-r--r-- 1 fachaud74 rc.users 5.4G Apr 25  2018 ./cle6_upgrade/cle-6.0.6237-201802172355.iso
-rw-r--r-- 1 fachaud74 rc.users 1.5G Apr 25  2018 ./cle6_upgrade/SLE-12-SP3-SDK-DVD-x86_64-GM-DVD1.iso
-rw-r--r-- 1 fachaud74 rc.users 4.2G Apr 25  2018 ./cle6_upgrade/CentOS-6.5-x86_64-bin-DVD1.iso
-rw-r--r-- 1 fachaud74 rc.users 1.6G Apr 25  2018 ./cle6_upgrade/SLE-12-Modules-v3.iso
-rw-r--r-- 1 fachaud74 rc.users 3.7G Apr 25  2018 ./cle6_upgrade/Cray-slebase-12-SP3-201709141039.iso
-rw-r--r-- 1 fachaud74 rc.users 1.7G Aug 31  2021 ./ansys/R193/ANSYS2019R3_LINX64_Disk1/tp/LINX64.GZ
-rw-r--r-- 1 fachaud74 rc.users 1.5G Aug 31  2021 ./ansys/R193/ANSYS2019R3_LINX64_Disk1/ensight/LINX64.TGZ
-rw-r--r-- 1 fachaud74 rc.users 1.1G Aug 31  2021 ./ansys/R193/ANSYS2019R3_LINX64_Disk2/common/LINX64.GZ
-rw-r--r-- 1 fachaud74 rc.users 9.5G May 13  2018 ./wrf-381.tar.gz
-rw-r--r-- 1 fachaud74 rc.users 1.2G Jul  5  2016 ./OpenFOAM-media/ThirdParty-v1606+.tar
-rwxr-xr-x 1 fachaud74 rc.users 1.1G Apr 23  2019 ./mpiapp.simg
fachaud74@raad2b:~>

What must I do now that the data migration is complete?

Confirm location of new home directory

First, confirm that your home directory has now been changed to /ddn/home/username.

fachaud74@raad2a:~> pwd
/ddn/home/fachaud74
fachaud74@raad2a:~>

Confirm modulefiles are sourced from /ddn

Next, check that your modulefiles are now being sourced from the /ddn/sw/xc40/cle7/modulefiles directory. Simply check the availability of one of the TAMUQ installed modulefiles (e.g. ansys):

fachaud74@raad2a:~> module avail ansys

------------------------------------------------- /ddn/sw/xc40/cle7/modulefiles -------------------------------------------------
ansys/195
fachaud74@raad2a:~>

Check .bashrc for references to /lustre

Next, if you have ever previously made modifications to it, check the contents of your .bashrc file to ensure it contains no references to the /lustre filesystem. If there are, you will have to replace any instances of the string /lustre with the string /ddn in your .bashrc file. Contact RC staff for help if you do see instances of /lustre in that file but for whatever reason are concerned about making these changes.

Fix erroneous symbolic links still pointing to /lustre

Finally, it is possible that the data migration process may not have handled symbolic links found within your home directories as we might expect. A symbolic link (also called a symlink) is a type of file in Linux that points to another file or a folder on the same system. Symlinks are similar to shortcuts in Windows. You may have had symlinks within your directory contents that pointed to a file or folder under the /lustre filesystem. The symlink would indeed have been copied by the migration process to its new location under /ddn/home/username, but it may still be pointing to something under the /lustre filesystem. We need a way to discover all such symlinks if they exist, and make them point to the same object they were pointing to, but under the /ddn filesystem. RC has written a script to help you with this process, and we can share with you a sequence of commands that employ that script.

Make sure you are at the top level of your home directory. Find all symlinks still pointing to something under /lustre and save a list of these to the file bad_symlinks.txt.

fachaud74@raad2a:~> cd
fachaud74@raad2a:~> find . -type l -exec ls -l '{}' \; | egrep ' /lustre' > bad_symlinks.txt

Copy the fixSingleLink.sh script from the specified location into your directory.

fachaud74@raad2a:~> cp /ddn/share/tools/fixSingleLink.sh .

Cerate a file with a list of commands using the script to fix each of the bad symlinks in your list. Save this list of commands to a file called myFixes.sh.

fachaud74@raad2a:~> awk '{print "./fixSingleLink.sh  " $9 "  " $11 " >> linkFixes.log"}' bad_symlinks.txt > myFixes.sh

Run the list of commands you just created.

fachaud74@raad2a:~> chmod u+x myFixes.sh
fachaud74@raad2a:~> ./myFixes.sh

The previous commands should have saved information about the fixed links to a log file which we can now use to filter out links that might still be bad even after the attempted fixes. This could happen for instance in cases where the new target location does not exist under /ddn. Such links were likely already broken or problematic even before the migration. If you are concerned about them, you can seek further assistance from RC staff.

fachaud74@raad2a:~> grep "WARN: " linkFixes.log | awk '{print $4}' > still_bad_symlinks.txt

It is a good idea to save the files (for a while at least) related to the fixing of the symlinks for examination and troubleshooting.

fachaud74@raad2a:~> mkdir fixed_links
fachaud74@raad2a:~> mv fixSingleLink.sh fixed_links/
fachaud74@raad2a:~> mv myFixes.sh fixed_links/
fachaud74@raad2a:~> mv linkFixes.log fixed_links/
fachaud74@raad2a:~> mv bad_symlinks.txt fixed_links/
fachaud74@raad2a:~> mv still_bad_symlinks.txt fixed_links/

Modify your slurm job files for any new jobs

Finally, make sure you make appropriate changes in all your slurm job files before submitting new jobs. All file and directory paths containing the string /lustre should be replaced with the string /ddn. For instance, one may find the following few lines in an existing job file:

# Create scratch directory for job
export MYSCRATCH=/lustre/scratch/${SLURM_JOB_ID}_$USER
mkdir -m 700 $MYSCRATCH

# Set Gaussian Root Path
g09root="/lustre/sw/xc40/cle7/gaussian/g09.d01"

In order to use such a file to correctly launch a job after the storage system transition, these lines will need to change to:

# Create scratch directory for job
export MYSCRATCH=/ddn/scratch/${SLURM_JOB_ID}_$USER
mkdir -m 700 $MYSCRATCH

# Set Gaussian Root Path
g09root="/ddn/sw/xc40/cle7/gaussian/g09.d01"

Note that because the bulk of software application locations are initialized by the loading of modules (e.g. module load matlab) we have already made appropriate modifications to our modulefiles so that this issue remains transparent for the end users. In some instances though, users may have specified locations of files or applications using absolute pathnames in their slurm job files (as in the example above). In these cases they will need to make the types of modifications seen above.