How can companies optimise their Big Data application?

Published on
12/12/2019 01:51 PM

72% of enterprises say data analytics creates valuable insight, according to a recent SAS report. The report also found that that 60% of firms are more innovative due to analytics resources.

Data analytics is maturing

Data is an inexhaustible resource, growing by 2.5 quintillion bytes every single day. As the amount of data generated is growing dramatically, the data analytics field is also maturing at a rapid rate.

A recent report from Dresner found that 80% of enterprises believe Big Data is important to their business intelligence strategy. In fact, from 2015 to 2018 Big Data adoption in the enterprise grew from 17% to an overwhelming 58%.

In order to process these massive volumes of data, however, IT organisations must implement a robust data management strategy. According to a whitepaper from Guavus, enterprises must "right-size their Hadoop clusters to balance the OPEX and CAPEX."

Achieving the right size of the Hadoop cluster

"Big data applications running on a Hadoop cluster can consume billions of records a day from multiple sensors or locations," Guavus notes. Applications process terabytes of data, which have the ability to create valuable insights in real-time or periodically.

However, real-time consumption "requires a more stringent query SLA and higher memory footprint." Companies should therefore consider peak data rates for real-time insights and hourly insights.

Companies often mistakenly consider replication factor as a replacement for RAID. RAID ensures data safety at a physical level, but companies should use both Replication Factor and RAID for highly precious data.

According to the report, enterprises should also consider that different stages of the process have different SLAs. Each stage therefore requires data cleanup and an extra "write operation on disk" should be added while calculating disk IOPS.

Infrastructure is also incredibly important. "For the same CPU, RAM and disk family, the performance is best on a physical deployment; it is about 20-30% lower on virtual machines or private clouds; and is about 60-70% lower on a public cloud."

Optimal sizing of the cluster is evidently critical for an application to continue to generate valuable insight. For more invaluable insights and recommendations regarding data analytics, take a look at the full whitepaper.

Enjoyed this article? Listen to our podcast with Big Data Strategist for Data-Mania LLC and the CEO of Catapult Coaching Lillian Pierson