Data Lake Hadoop

The premium cost and rigidity of the traditional enterprise data warehouse have fueled interest in a new type of business analytics environment, the data lake. A data lake is a large, diverse reservoir of enterprise data stored across a cluster of commodity servers that run software such as the open source Hadoop platform for distributed big data analytics. A data lake Hadoop environment has the appeal of costing far less than a conventional data warehouse and being far more flexible in terms of the types of data that can be processed and the variety of analytics applications that can be developed and executed. To maximize these benefits, organizations need to carefully plan, implement and manage their data lake Hadoop systems.

Moving Data into a Data Lake Hadoop Environment

One of the primary attractions of a data lake Hadoop system is its ability to store many data types with little or no pre-processing.. But with this source data agnosticism can come a couple of "gotchas" that businesses need to be aware of when planning for a data lake Hadoop deployment:

  • Without careful planning and attention to the big picture, it's easy to end up with a hodge-podge of multiple data movement tools and scripts, each specific to a different source data system—making the data migration apparatus as a whole difficult to maintain, monitor, and scale.
  • Each separate data flow may require coding work by programmers with expertise in the source system interfaces and the Hadoop interface. This need for programming man-hours can become a bottleneck for launching a new data lake Hadoop system or making changes to an existing system once it is up and running.

These data lake Hadoop problems can be avoided by using a purpose-built big data ingestion solution like Qlik Replicate®. Qlik Replicate is a unified platform for configuring, executing, and monitoring data migration flows from nearly any type of source system into any major Hadoop distribution—including support for cloud data transfer to Hadoop-as-a-service platforms like Amazon Elastic MapReduce. Qlik Replicate also can feed Kafka Hadoop flows for real-time big data streaming. Best of all, with Qlik Replicate data architects can create and execute big data migration flow without doing any manual coding, sharply reducing reliance on developers and boosting the agility of your data lake analytics program.

Whitepaper

Five Principles for Effectively Managing Your Data Lake Pipeline