Common FAS Performance Myths

By | May 17, 2016

This is another one of those topics that crops up on the community every few weeks.  This time around a customer was nice enough to phrase the myths as two nice bullet points.

 

  • When we (NetApp) going to start supporting 25/40/50GbE NICs instead of just 10GbE?
  • It is likely FAS couldn’t drive interfaces that large due to NVMEM bottlenecks for writes and disk spindle count limits and capacity size limits on aggregates for reads…? FAS isn’t a large bandwidth storage device.

 

Performance for any product should never be measured by the speed of its fabric interfaces.  Traditionally no modern storage to client connection is able to saturate a modern interface of any type, regardless of protocol choice.

As an example 10gbe was released as a standard back in 2002 as a connectivity option and was originally used for ISL and aggregators.  The reason was simple, back then the average enterprise application pushed 2.5mbs, a quarter of a gigabit interface.  It became viable as primary media connectivity around 2012, and at that time throughput averages were getting into the 15mbs range, so finally surpassing 1gbe speeds (which were released in 1998).  Applications are almost a decade behind fabric development for most workloads.  Now some will argue infiniband and HPC workloads and those are the exception.

A good example are the SPC benchmark test.  SPC2 is ranked on throughput vs response time (latency) and the throughput value is only there to provide a load curve for latency.  SPC is a block based test and is almost always done with FCP.  If we take a look at our own posting for both FAS (http://tinyurl.com/kuv6yx6) and ESeries (http://tinyurl.com/h4cjeu3) we can see the posted result is 11352MB/s which is roughly 90Gb/s.  This is done over 8 paths of 16Gb and only results in ~60% load on a port.  This is just a hair past what a 10Gb port can do and this is pushing a system under maximum throttle.  What’s more important is the response time of the load, which even at maximum was only 760us (microseconds).   As a comparison the highest SPC2 result clocks in at 70120MB/s which is an impressive 560Gb/s @ 66ms which makes it nonviable for any application over 32 devices and 64 ports which makes it cost prohibitive (4M USD) AND puts port utilization at 55%, which is where we are sitting at .0008% of the latency.

 

Throughput alone is a bad metric, almost the worst metric you can use on its own.

 

That being said, technology marches forward and the cost of 40Gb is coming down so were introducing them into our systems later this calendar year.  For something more specific ask your local friendly NetApp NDA team.

The follow up to question one is almost always question two and is borne from FUD and misunderstanding on how a FAS really works.  Some of this is our fault (yes, I am part of that crowd who added to the confusion).  The myth is that all NetApp writes go to NVRAM/NVMEM.

 

This is a myth.  It’s almost a lie.  It’s also the easiest way to describe the truth.

 

When a write comes to the FAS it’s really going into memory.  What NVRAM gets is a log of the write and the parity calculation.  Once we have the parity calculation we tell the client we have the data and then figure out how were going to get that write to disk with the rest of the writes we have also stored.  This process is furthered muddled by where it actually resides as it moves around during the tetris build process.  Needless to say this makes for a complex explanation on why NetApp writes are fast.  It was easier to just say they go to NVRAM and that’s good enough (I used to say the same thing, so I helped build this misconception, go ME!).

Less scrupulous people took that statement as “You’re writing to PCI or secondary memory, which is slower then writing to direct memory AND you create a bottleneck since this goes down a PCI path”.  They would be almost right if we sent writes down a PCI path.  Since FAS only logs parity, PCI bandwidth is irrelevant.  Since parity calculations occur at L0 cache at the CPU level, writes are fast.  Stupidly fast.  The only time writes really become a concern is if they cannot be flushed to main disk fast enough.  Then FAS starts to run into an issue as memory is not released quick enough for new writes and the system has to start delaying additional writes to give ones in flight enough time to flush.  Normally this is caused because of lack of spindles available OR not enough headroom on the node.  It’s rarely a matter of raw throughput and almost never the port unless you have a poorly designed fabric.

Since this stems from the belief that somehow FAS in not a high performing storage system lets focus in on its SPC1 results (http://tinyurl.com/zh7xkod).  Let’s get the neigh-sayer out of the way right up front.  The benchmark result for the SPC1 result was 5622MB/s which is 45Gb/s, half of the SPC2 result.  That kills the misconception that a FAS can’t do 10gb of throughput right there.  SPC1 however is not about throughput it’s about IO and like SPC2, it is about the metric vs latency.

IO is the measurement of input and output request made in a second, aka operations.  In its most basic form it’s how much stuff is done vs time.  On its own that metric is actually the worst metric for performance, since 2/3rd of the equation is missing.  Throughput = IO(In + Out * Sec) * Size.  If we only know IO but not how big it is then it’s worthless.  If we know how big the IO was but not how long it took to do the operation it is equally worthless. Not all IO is created equal.  If marketing is toting a number as a selling point then it’s probably suspicious.  Enter SPC1 results which create a level playing field, since the test parameters are the same.  Their testing parameters run a blended IO load that stresses an upper 60% write workload over varying block types for 4k-128k.

The two major performance key metrics from the SPC1 results

 

IOPS vs Response Time – How much work you did and how long did it take to complete?

685,281 @ 1.23ms

As a benchmark most OLTP application like sub 10ms.  Virtualization likes sub 30ms.  This result is 1/20…

 

Sustainability – Was the response time consistent across the board with the test?

I think a picture does best here instead of a 400page excel spreadsheet.

 

Normally the test is run for a minimum of 8 hours.  This helps weed out caching scenario and help show if a system can sustain its load over long periods of time.  If you can only go fast for 5 minutes and then suffer from memory exhaustion you’re not really all that reliable.

 

The 1.23ms latency was consistent regardless of IO load.  This is the very definition of a high performing system.  Reliably consistent storage at all loads.

Performance though is not the only metric that we can pull from SPC results.  Arguably the more important metric goes into its cost factors.  We can make something that’s super reliably, ultra-low latency, with a ton of IO easily.  If the solution cost more than the GDP of most 3rd world countries it’s not very useful, unless you can afford to buy a 3rd world country.

 

What do we look for from a cost standpoint?

Well cost, it’s a required field but take into account what cost is shown.  Some results post list, other post flat discounts, and some less then scrupulous ones will show discount levels that require you to buy 4 of them to get those rates.

 

Application Utilization (AU).  How much of the system was actually used for the test.  This use to be called short stroking a drive, now it’s called filesystem aging or sprawling.  This also takes into account the systems raid type as those are disk that are provided to the system that do not contribute to its effective capacity.  Most things are posted as Raid10 because for the longest time it had the best write performance of standard raid levels.  It also means 50% of your drives are gone off the top.  This also means that the best most vendors can hope to achieve is an AU of 50% if they don’t do any spares.

 

NetApp FAS SPC1

Current #1 SPC1 result Huawei OceanStor

So what’s with that Protected Application Utilization #?

Some will argue this metric is more important than the Application Utilization #.  Those people tend to have Raid10 systems and half of the usable capacity is being chucked out the window.  This is how much of the end usable space (post formatting) the system provided.  The problem with using that as a metric is one could then create a pretty complex disk mirroring scheme 5x over and then just fill up that small piece.  To the brim.  If it started with 100 drives you would in essence only use 20 of them, sure its 100% space efficient in the protected area, it’s still 80% waste from what you purchased.

 

Summary

Is NetApp FAS a high performance platform?

The data speaks for itself.  685K real world IO @ 1.23ms consistent latency.  It’s not sitting in the top ranks because it’s not a “high performance tier1 system”.

YES IT IS!

 

Now is it perfect for every application?

No.  Nothing is ever a silver bullet for all problems.  That’s why it’s silver.

 

The issues that crops up when performance requirements are being thrown around should not be how many MB can your interfaces do or how much IO can you shovel.  The Question to answer is at what latency.

Does your application need 40gb of consistent low level latency throughput because it’s a sprawling general purpose virtualization farm?

FAS can do it.

Does your application need 400gb of consistent low level latency because it’s an HD streaming repository for 750 live HD feeds?

FAS could do it, but ESeries would be better suited.

 

Cheers!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.