Thrashing Your SATA for Fun and Profit

N.B. this page is not intended for the really inexperienced! It may turn into "how to boil an egg on my disk?" quite quickly and some damages may be irreversible

One of the major issues of my target is high speed sustained writing to disks. In the process of deciding how to continue, or just getting an answer to the usual question "what have I thrown my money on?", finding the hardware and/or software limits is a must. As a rule of thumb, if I may say so, never try to see maximum hardware performance, especially when disks are involved, while adding any unnecessary software level above the lowest level drivers. For the vast majority that hasn't understood what I just wrote, that translates to "forget about file systems, LVM or anything like that!".

Another important rule when estimating the disk raw speed is to go for sequential writing/reading. Jumping back and forth over the disk is costing time and that's expensive in term of sustained transfer rate. Also it's very important to feed the data fast enough so the disk logic doesn't decide to park the heads between bursts.

So I bought 4 1TB Seagate Barricade 7200.11 drives (ST31000340AS), as being both extremely fast at a sustained speed of 105MB/s, but also quite reliable when compared to their bigger 1.5TB brothers. Even worse, I bought two SATA port multipliers to be able to connect the disks in various ways to see how far will I get. With an ideal target of over 160MB/s,nothing was too expensive for experimenting.

As of writing this page, Seagate has launched for a month already the 7200.12 series which provide up to 160MB/s sustained transfer rates. But with the money already spent and the delay between devices launch and them being available on the Romanian market, they will need a re-visit sometime in the future.

"One Small Step for a Man ..."

Never go immediately for whole structure. Sometimes (please read it as almost always) it means just wasting money and time! And these days (Feb 2009) money is most important as time some of us might have plenty, unfortunately! So I started with just one disk to see how fast can I go with it and how true is the Seagate datasheet. Fortunately for them, and me, I could almost get the announced 105MB/s sustained transfer speed.

There are though some simple rules to consider:

Always use the O_DIRECT flag when opening the whole disk file or you'll get a lot less speed. When your are targeting high speed sustained recording, software caching is just a waste of time.
Don't even bother adding the O_NONBLOCK flag and try using some fancy asynchronous approach as that's not supported for disks and local file systems. It seems that they are fast enough for 99.99% of the applications!
Try to match the buffer size to the actual hardware possibilities. When using a single disk that translates to "keep system calls to minimum". So use a large enough buffer to keep the SATA controller busy instead of keeping busy the kernel. With the Marvell SATA host controllers, so far, this means up 32 SATA commands in the queue, 31 actually usable at any given time. And in each command 64KB can be transferred by the 88F5182 included SATA host controller. So a buffer of 2MB-64KB can be transferred through a single write command so the hardware is kept as busy is possible. The test application uses a 1MB buffer because the target application needs buffers to be a power of 2.

Speed Matters

With such good results obtained using a single disk, I have moved on. After installing a second disk I started trying to find the best approach in software. Because writing to two disks in parallel is not an easy task. The only certain thing is that the disks must be written to in turns, so while one is busy executing a set of commands, the other is fed with commands. I have took the following steps, all failing but the last, of course:

Opening the files with O_NONBLOCK flag - Of course, I have read the lines saying that the flag is not implemented for disks and local file systems after I struggled to get parallel writing with such an approach. The obvious result was 105MB/s.
Writing to the disks in separate threads, one for each disk - While sounding really clever, the results where poor. I could not decide on the real cause of such a poor final result (something like 50MB/s), but I decided to look for a better solution.
aioXxxx functions - With huge experience in Windows programming, I wanted some true asynchronous approach, to let the kernel do the transfer while the process is available for other work. And I have found plenty of functions named aioXxxx that seemed almost perfect. Silly me, now I know that seemed was the key word back then! The results was as good, or better said "as poor", as with my multi-threaded approach. After a lot of digging I found out the those functions are part of the Pthreads library and do something similar to my multi-thread code, but in a more generic way, and more "dirty" if I consider the application!
io_setup and associated functions - These functions were really difficult to find. First of all they are Linux specific and thus not mentioned by all those *nix developers. The second impression was of "best kept secret"! I had to really dig to find some minimal example, but even that one so old that I had to look into the kernel source files to see why I was getting one persistent error! In the end the code got really clean and I could see some significant speed improvement. Not that I have reached the theoretically expected 210MB/s, but a decent 205MB/s was visible.

As a last test, I have thrown in both port multipliers, insuring that I connect in turn the disks so I have '/dev/sdb' and '/dev/sdd' on one of them and '/dev/sdc' and '/dev/sde' on the other. The final result was an astonishing 240MB/s when no other major disturbing activity occurred.

I did the same test on a rather fast PC, with two of the disks. Could not get faster than 160MB/s, so the little chip from Marvell is really amazing with it's completely inter-connected architecture.

Adding Extra Kernel Layers

Just for the completeness of this batch of tests, I decided to go one step further and build a RAID0 array out of the 4 disks. As it was pretty obvious from the very beginning that the performance will be affected by the array creating parameters, I decided to go for a set of tests and see what happens. The commands and the corresponding results are:

| Command | Speed | | ------- | ----- | | mdadm -C /dev/md0 -v -n 4 -l raid0 /dev/sda /dev/sdb /dev/sdc /dev/sdd | Started at 178MB/s, degraded to 176MB/s | | mdadm -C /dev/md0 -v -n 4 -l raid0 /dev/sda /dev/sdc /dev/sdb /dev/sdd | Started at 178MB/s, degraded to 170MB/s | | mdadm -C /dev/md0 -v -n 4 -c 1024 -l raid0 /dev/sda /dev/sdb /dev/sdc /dev/sdd | 249MB/s | | mdadm -C /dev/md0 -v -n 4 -c 1024 -l raid0 /dev/sda /dev/sdc /dev/sdb /dev/sdd | 244MB/s | | mdadm -C /dev/md0 -v -n 4 -c 2048 -l raid0 /dev/sda /dev/sdb /dev/sdc /dev/sdd | 247MB/s |

The first conclusion that really matters is that the RAID chunk size should properly match the available hardware. In this case 1024Kb gave the best results.

The second, and quite puzzling I must admit, conclusion is that the disks should be accessed first on one of the SATA ports and then on the other SATA port. The speed penalty is not big, but it might matter. As of kernel version 2.6.29-rc3-git10 this has been apparently fixed. Due to a series of fixes in the md support, this now seems to be consistent with my initial logic.

But, of course, the most annoying thing is that RAID0, properly set, proved to be faster than my best access scheme.

"No RAID" Revisited

Based on the RAID0 tests, I decided to make a final test by accessing the disks in the '/dev/sda', '/dev/sdb', '/dev/sdc' and '/dev/sdd' instead of the original approach of '/dev/sda', '/dev/sdc', '/dev/sdb' and '/dev/sdd'.

With this final test I regain confidence in my considerations - just over 252MB/s!

The 252MB/s Wall

After a lot more work I have finally understood what limits the SATA transfer speed to 252MB/s, no matter how the disks are installed on the two port multipliers. Being announced as a board with 128MB DDR memories, and briefly checking the SoC and the memory chips datasheets, I have considered that the theoretical limit of about 510-520MB/s read speed could be achieved. Totally wrong as the SoC is capable of transferring just 4 32-bit words in a burst instead of a maximum of 8 32-bit words. So the theoretical limit is around 313MB/s. With the extra overhead of SATA DMA and the actual operating system running, 252MB/s is quite a decent speed.

The Code

The code used to test this aspect is shown next. 'libaio' must be installed in order to be able to compile it and the kernel must have asynchronous I/O support enabled.

#include `<aio.h>`
#include `<errno.h>`
#include `<fcntl.h>`
#include `<libaio.h>`
#include `<stdint.h>`
#include `<stdio.h>`
#include `<stdlib.h>`
#include `<unistd.h>`

*/*///////////////////////////////////////////////////////////////////////////

#define BufferSize      0x00100000
#define BuffersNum      ((0x40000000/BufferSize)*512)
#define DrivesNum       4
*#define DrivesNum       1 * when testing RAID 0
static const char *Drives[]={"/dev/sda","/dev/sdb","/dev/sdc","/dev/sdd"};
//static const char *Drives[]={"/dev/md0","/dev/sdb","/dev/sdc","/dev/sdd"}; // when testing RAID 0
#define BuffsPerDrive   (0x01000000/BufferSize/DrivesNum)

*/*///////////////////////////////////////////////////////////////////////////

int main(void)
{ void *Buf[DrivesNum][BuffsPerDrive];
  int fd[DrivesNum];
  size_t i,j,BufIdx;
  uint32_t StartTime,LastTime;
  uint32_t _StepsNum;
  uint64_t _TotalTime;
  io_context_t Ctx[DrivesNum];
  struct iocb CBs[DrivesNum][BuffsPerDrive];
  struct iocb *CBPs[DrivesNum][BuffsPerDrive];
  struct timeval tv;
  //  Opens the files.
  for (i=0;i<DrivesNum;i++)
    if ((fd[i]=open(Drives[i],O_DIRECT|O_LARGEFILE|O_WRONLY))==-1) perror(Drives[i]);
  //  Allocates the memory blocks from which the write should be done.
  for (i=0;i<DrivesNum;i++)
    for (j=0;j<BuffsPerDrive;j++)
    { Buf[i][j]=malloc(BufferSize+4096);Buf[i][j]=(void *)(((uint32_t)Buf[i][j])&-4096); }
  for (i=0;i<DrivesNum;i++)
  { Ctx[i]=NULL;
    j=io_setup(64,&Ctx[i]);
    if (j) printf("io_setup %d - Error %d\n",i,j);
    for (j=0;j<BuffsPerDrive;j++) CBPs[i][j]=&CBs[i][j];
  }
  BufIdx=0;
  for (_TotalTime=0,_StepsNum=1;_StepsNum<=BuffersNum;_StepsNum++)
  { gettimeofday(&tv,NULL);
    StartTime=tv.tv_sec*1000000+tv.tv_usec;
    for (i=0;i<DrivesNum;i++) if (fd[i]!=-1)
    { struct io_event Ev;
      if (_StepsNum>=BuffsPerDrive) io_getevents(Ctx[i],1,1,&Ev,NULL);
      io_prep_pwrite(&CBs[i][BufIdx],fd[i],Buf[i][BufIdx],BufferSize,(uint64_t)_StepsNum*BufferSize);
      io_submit(Ctx[i],1,&CBPs[i][BufIdx]);
    }
    gettimeofday(&tv,NULL);
    _TotalTime+=(LastTime=tv.tv_sec*1000000+tv.tv_usec-StartTime);
    BufIdx=(BufIdx+1)%BuffsPerDrive;
    if (!(_StepsNum%10)) printf("Step %u/%u - %.02lf MB/s - %.02lf MB/s          \r",
                                _StepsNum,BuffersNum,
                                (float)BufferSize*DrivesNum/(int32_t)LastTime,
                                (float)BufferSize*DrivesNum*(int32_t)_StepsNum/(int64_t)_TotalTime);
  }
  for (i=0;i<DrivesNum;i++) io_destroy(Ctx[i]);
  //  Frees the memory blocks.
  for (i=0;i<DrivesNum;i++) for (j=0;j<BuffsPerDrive;j++) free(Buf[i][j]);
  //  Closes all opened files.
  for (i=0;i<DrivesNum;i++) if (fd[i]!=-1) close(fd[i]);
}