Efficiently use server resource

Since now we have more people using biomisc server, let us know some common rules of using the server:

Central storage
There are many huge genomic files needed by all of us, being used repeatedly, such as genome sequence files, gene annotations, index files for bowtie or BWA, and even public datasets from ENCODE. They are currently stored in central places: /data/db, and /data/encode. So make sure you don’t waste time to download existing files or waste space to store duplicate files. Make symbolic links instead of copying the files to your own working directory. Try your best to make one file only one copy on the server. If you have some common files that you think other people may need in the future, create a folder under /data/db and put files there.

Working directory
Now we have /data and /data2 two hard disks. You can make your own working directory at either place if there is enough space left. Make sure you change the permission of your files so that ‘group (tliu4) can read’. A good habit is to separate your projects into individual directories; separate each individual jobs into sub-directories otherwise it would be easy to mess things up; separate raw files, processed files, templates of scripts and so on into different directories. Make as many symbolic links as possible to save space, but remember NOT to ‘follow links’ while copying or archiving your files.

Compress or remove old files, or make your problem able to read compressed files directly
No matter how big the hard disk is, without good practice, it will be filled with trashes quickly. The most important rule is to remove unused big files and compress those files you think: a) it will take significant long time to reproduce; b) it will be used in the near future. Many of your files are plain text files. In this case, a simple gzip will make original file 70% smaller. If you program in Python, Python has a nice library called ‘gzip’ to read gzip files directly.

Use the right format
Bioinformatics is about playing around different formats. It’s recommended to spend some time to learn what is the most efficient format to store your data. For example: to store sequencing data, fastq is THE format; to store alignment results, BAM is significantly smaller than SAM; to store signal profiles along the whole genome, bedGraph is the right plain text format ( where continuous same values will only be merged into one line ), and bigWig which is a binary version is much much smaller. There are plenty of existing tools that can process those files even if they are in binary format. For example, samtools and bedtools can operate on BAM, UCSC series tools (http://hgdownload.cse.ucsc.edu/admin/exe/) can be used to operate bigwig files.