Sunday, July 26, 2015

Big Data or Data Warehouse

Big Data is a Disruptive Technology and triggers Business Intelligence (BI) and conventional data warehouses (DWH) sooner or later from or is Big Data completely unsuitable for conventional BI and has no place in this environment not.

On April 16, 2014 William Harvey Inmon wrote a remarkable blog entry. Under the headline, Big Data or Data Warehouse Turbocharge your Porsche - buy an Elephant is working up the Father of Data Warehousing on the Cloudera advertising slogan BIG DATA - Turbocharge your data warehouse.  Basically, the blending of architecture (data warehouse) and technology (teased Big Data). In particular, however, he emphasized that it was simply impossible with the Hadoop platform Apache, provide for data warehouses is strictly necessary, carefully constructed and operated information infrastructure.

Every manager who insets Big Data technologies for Sarbanes-Oxley or Basel II reporting, would retain their job much longer. Those are strong words for a clear delineation of both worlds. Here Inmon is not just anyone 2007 counted him Computerworld to the ten most important IT personalities of the past 40 years.

Exactly at the same time joined Ralph Kimball, Inmon next certainly the most influential data warehouse protagonist, exactly in the area said Cloudera offensive for using Hadoop as a data warehouse platform a. He praised flexibility, performance and cost savings for future Hadoop data warehouses and forecast this approach has great potential.

Now Inmon and Kimball represented already in the nineties very different approaches, such as top-down vs. Bottom-Up or Normalform- versus dimensional modeling, but in practice, fortunately quite usefully complemented each other and have thus strongly influenced the most current data warehouses. The new conflict of Kimball and Inmon is equally characteristic of the current debate around data warehouse and Big Data. And as is to be hoped that he will have an equally positive effect on future solutions in this environment as previous clashes.


In order to understand what led to this apparent technical reason diametrically opposed opinions, take a look at conventional data warehouses is first necessary.

The classic data warehouse
Data warehouses are usually operated on a relational database. The data stored therein to be read and processed by SQL. For the preparation towards users, the so-called data marts, are partly specific multidimensional OLAP databases in use. Both technologies are ideally suited for many typical applications of data warehouses - for example, for business management reporting and controlling.
Relational databases provide the option of purchasing data with each other to connect (Joining) the flexibility for ad hoc queries - even with completely new requirements. In addition, the query language SQL is easy to learn and is directly supported by any established BI and reporting software. Relational databases also provide data consistency in principle, high availability and a mostly acceptable processing speed for large, structured data sets.

Many users work very interactively with data. Filter, add and fold hierarchical levels on and, just as, for example, it made from the Excel knows familiar pivot tables. But this is done by means of SQL to very large amounts of data and many tables are thereby connected to each other, is quick to expect delays in the two- and three-digit second range.

Here traditionally come multidimensional OLAP databases into play. Their structure the data from a business perspective remains very strong and thus forms from detailed customer-specific business models. They have differed ratios of dimensions, hierarchy’s form of master data and in particular they provided information already many compression stages pre-calculated and on high speed optimized query ready. For example, can be accessed directly on revenues per country and product group without the ratios derived from the single order items for each query to calculate again and again.
If there are no second or minute current data required, for example, a daily calculation much more efficient than the permanent recalculation of all sums for each individual query. This applies regardless of the user Technik.Viele used from the fields get away with it on very well. The triumph of RDBMS and MOLAP in the past 20 years was anything but accidental.

Both technologies present data in a unified and stable model, which provides users with an understandable all time and reliable database for technical questions and support this professionalism at the same time by the model. The sometimes invoked dissatisfaction of BI users is less attributable to the technology used, but mostly for inadequate modeling, neglected communication, and lack of technical documentation or insufficient quality of data.

In short, the use of relational databases and data marts MOLAP is most commonly used in today’s data warehouse applications obviously a good choice. The introduction of in-memory techniques in relational databases in the last 2-3 years the possibilities in terms of timeliness and query performance have been improved even more significantly.

Big Data New job profiles

In the partly euphoric assessments of Mark researchers and IT companies is talk of new job descriptions that will bring big data with repeatedly. These include the following activities

Data Scientist

They determine which forms of analysis as the best means suitable to achieve the desired knowledge and raw data which shall be necessary. Such professionals need sound knowledge in areas such as statistics and mathematics. In addition, knowledge about the industry in which a company or operates and IT technologies such as databases, network technologies, programming and business intelligence applications. Demanded Likewise negotiating skills and emotional competence when it comes to cooperation with other departments.

Data Artist or Data Visualizer
They are the artists under the Big Data experts. Your main task is to present the evaluations so that they are understandable to business managers. The experts for this purpose implement data into graphs and charts.

Data Architect
You create data models and determine when and which analysis tools are used and what data sources to be used. They also need a comprehensive expertise in areas such as databases, data analysis and business intelligence.

Data Engineer
This objective is strongly aligned with the IT infrastructure. The data engineer is the Big Data Analysis System in charge, i.e. the hardware, software and network components that are needed for the collection and analysis of data. A similar function has a system and network managers in the IT sector.

Information Broker
They can play multiple roles, such as a data distributor, the customer provides information, or an in-house expert, the data sets from different sources within and outside the company procured. It also aims to develop ideas on how useful this information can be used.

Data Change Agents
These professionals have a more political function. They should analyze and adapt existing processes in the company, so that they are compatible with Big Data initiatives. Only then can be drawn from such projects derive maximum benefit. Important therefore have strong communication skills, understanding of business processes as well as knowledge of quality assurance and quality management (Six Sigma, ISO 9000).

The limits of classical data warehouses
But conventional data warehouse solutions have numerous limitations in spite of everything. So it is difficult for RDBMS and MOLAP databases when it comes to unstructured data such as documents, movies and audio files or data frequently changing or not pre-defined structure. The information is of course already structured, but not in a form that allows conventional relational analysis to the stored data.
The latter also includes data which implicitly bring their structure, such as XML. So way, Bill Inmon has even the meaning of unstructured data in data warehouses emphasized in his DW2.0 description. For some of these requirements, they have existed for some time, but also meaningful RDBMS extensions.

Conventional relational or MOLAP databases are not optimized to deal with tens of thousands or even millions of individual transactions per second, as they may apply for instance in the timely processing of engineering, sensor or social media data. It is not the throughput problem, because even on a simple PC can thousand records are loaded per second in an RDBMS and connected to others.
Difficult it is, rather, to process each record individually, so to stream the data in real time. Through this separation, additional resources are consumed and the due consistency rules bottlenecks are inherently strong in relational databases. But again, first manufacturer slowly open new procedures, such as Microsoft SQL Server 2014 with the in-memory OLTP features.

If huge amounts of data have to be evaluated is to reckon with higher response times from user queries on popular relational databases. Are large amounts of data finest granularity, such as to process calling Data Records at phone service providers, best response time are difficult to reach by MOLAP solutions. Moreover, it happens that the timeliness of queries a precompaction even prevented or urgent requirements must be implemented on large stocks Ad hoc. To the resulting response times, reduce from hours to minutes or seconds, other technical approaches are necessary.

The sheer amount of data is rarely an insuperable technical limit for relational database systems respectively large database cluster. However, they result in high terabyte or petabyte to impairments. Thus, administration standard procedures such as backup or complex migrations of data structures can no longer be carried out by usual pattern without the operability limit noticeably.

Finally, the licensing, support and hardware costs play modern, powerful RDBMS or MOLAP software for use in data warehouse a substantial role. Here are increasingly attractive, especially for extremely large amounts of data options for cost savings.

When processing large amounts of unstructured or inconsistently structured data, high individual transaction rates, while demand for short latency and response times for ad hoc queries, push conventional DWH solutions at borders. The ongoing IT costs can be reduced only limited by the mere use of data warehouses. Nevertheless, in order to meet these requirements in total, new technologies come into play.

License management in database environments
Modern database solutions are now integrated deep in virtualization, high availability and cloud environments from different manufacturers. Cyclical analysis of license inventory provides in this case valuable information on how evolved the need for a prolonged period.

The necessary steps using the example of an Oracle license analysis are

1. Determine the hardware characteristics

From the Configuration Management.

2. Determination of the software characteristics based

On an analysis tool set.

3. Determination of the license characteristics
Based on the Oracle contract documents of the customer taking into account the information from the Oracle installed base, the OLSA -. Oracle License and Service Agreement, the support agreements and other license-related documents such as SIG - Software Investment Guide.

4. Comparison of the current situation
(Hardware and software characteristics) with the target state (license identification data).

5. Evaluation of the consequences
From the planning.

6. Final report with reviews, recommendations,

Offer juxtapositions and possible measures.

Unlimited freedom

Both a data structure as well as amounts of data offers NoSQL databases. Many of these systems focus on data without fixed predefined columns. Thus, it is possible to add new data structures during the operation. In addition, they enable low latency when processing high transaction volumes. This makes them ideal for web applications with tens or hundreds of thousands of concurrent users as well as for the timely processing of large data streams - as they arise, for example in social media platforms, technical production or the Internet of Things (IoT).

The rapid processing of huge amounts of arbitrarily structured individual files in batch processes on the other hand dominated by Apache Hadoop. This is a framework for the implementation of distributed tasks on different computer systems. A merger of many on a docking software product is the so-called Hadoop Ecosystem. For now, there are numerous, partly commercial distributions - among others that of the cited IT specialists Cloudera. Such distributions include next Hadoop also NoSQL databases, orchestration, scripting, programming, analysis, and data integration tools and much more.
... Or choose their limits

The fact that individual components are often developed independently, however, has not only advantages The flexibility of solutions is indeed high and there is a wide range of specialized tools for every application. Their interplay works currently but by no means seamless. In addition, many tools are currently still at a relatively early stage of development. The time required for a high stability and maturity of consolidation is therefore far from complete.

To understand why this Big Data technology is traded as high in the Business Intelligence environment to some extent, we have to take a look at the specific advantages and disadvantages.

Apache Hadoop is a framework to run on many simple autonomous machines on a network program that distributed to common tasks work. The main objective is the linear terms of scalability on virtually unlimited, low-cost computing nodes, often on a simple PC technology. Hadoop is based on a special file system called HDFS (Hadoop File System). Thus files are distributed to each node stored, so only part of a file on a [1] filed compute nodes. In fact, all data is stored for security on multiple computers. Programs that process these arbitrarily structured data, do so after special programming models for distributed processing.
Such models, such as MapReduce dictate how data most effectively scanned in a Hadoop cluster, interpreted, compressed and merged.

Rapid development
In addition to all the benefits of new tools unfortunately infected much of what could intervene in the relational databases for decades, still in its infancy. For example, while in the RDBMS optimal execution plans for the processing of SQL commands through so-called Optimizer components using numerous data statistics are automatically calculated and executed, the compilation of the optimal processing methods with new technologies remains today the most part still left to the programmer. While some tools, such as SQL-MapReduce to converter HIVE or the scripting engine PIG generate already optimized Hadoop programs, but are not nearly as efficient as a conventional database optimizer. Moreover, usually these programs running in minutes or hours field and are not intended for interactive work.

But the next generation of Big Data programs will be waiting. New tools such as Apache Spark work directly with HDFS data and optimize both throughput and response times through more efficient process management, special indexing or caching mechanisms.
In-memory solutions such as Impala to develop serious alternatives for interactive applications in the BI environment and tools for cross-platform analysis as Presto or commercial options such DataVirtuality enable the integration of various approaches to a common, SQL-compatible view.

The right choice


Obviously there is in the world of Big Data technologies no complete solution, all the limits of relational or multidimensional worlds can overcome easily and seamlessly with the database architects. Certain tools and modeling variants are optimal for each problem area. The following examples show possible solutions with these technologies, and connected with an insert disadvantages compared to traditional RDBMS and MOLAP.

Unstructured data such as documents or implicitly structured data such as XML can be practically processed with all programming and scripting languages. About centralized metadata defined schemes also allow a different one on multiple and different views of any kind of data, because the access mechanisms and the transformation are freely implementable into a different presentation.

Disadvantage the access mechanisms must be implemented manually Unfortunately from a few predefined formats today. The number of new, generally available formats rises but steadily.

For timely processing of many individual transactions various components necessary and the possibilities of technical implementation are manifold A distributed messaging system as Apache Kafka, for example, to capture of transactions are used. By means of a Distributed Computing Engine as Apache Storm, the data then preprocessed many compute nodes, classified and posted in a NoSQL database like HBase. The processing of individual transactions can be carried out within milliseconds. The extreme scaling over many compute nodes are thus overcome even hundreds of thousands or millions of transactions per second.

Disadvantage architecture and implementation are complex and the data stored in the NoSQL databases data are not arbitrarily linkable, also joins there is often no. In addition, the data consistency cannot be guaranteed throughout at all times. In addition, common BI tools are often unable to work with NoSQL databases together.

Even extremely large amounts of data, for example, hundreds of petabytes can be stored and processed in a Hadoop cluster. Backup and Recovery or High Availability are here solved differently than in relational databases, however, subject to the possibility stronger restrictions. A fully consistent backup you will practically can never create. But also locally widely distributed infrastructures, so-called stretch cluster, nor efficient possible.

With License and Hardware Costs solutions are very cheap around Hadoop, when one considers the investment costs per Terabyte of user data. Since especially widespread and cheap hardware is used, can be a year of expenses well below 1,000 euros per terabyte raw data out (uncompressed) for storage and processing. When renting in a cloud solution, these prices can even be broken down to days or hours.

There are also little or no software licensing costs, since most tools are available under an open source license. Considerations of the total cost of ownership (TCO) however, can lead to very different results if support, expertise building and maintenance, administration, state costs, Backup and all other usual cost blocks are included. Then shrink the advantage or may even be much worse than those of commercial RDBMS with local or network-based storage in case of improper use. In order to make the right decision, it is important here to consider the application in question.

Piktochart

Templates help in designing of modern infographics. The pro section holds templates for typical business topics ready.
Simple, but effective
The editor of Piktochart works like a heavily stripped down graphics program. With simple tools, the templates can be edited and customized.

Google Charts
Google Charts offers a gallery with all kinds of different chart types.

More for developers
The tool is extremely flexible, but clearly a case for developers.

iCharts
The sober surface of iCharts is decorated in the style of Windows.

Data feed
The strength of iCharts is not just the surface or appearance of it generates graphs, it succeeds easily, with data to feed the service.

Easel.ly

In the thousands of templates in Easel.ly found almost for any purpose an already pre-designed chart.

Surface
The convenient user interface succeeds in a short time top designed infographics and charts.

Gallery
The Gallery of Easel.ly provides more than one million public infographics. Each of them can be loaded into the editor mode.

Infogr.am
The surface of Infogr.am is immediately internalized and extremely clear.
Large Selection
The chart types can Infogr.am nothing to be desired.

Data Upload
Info Gram can be fed with data uploaded and produces stylish, editable charts.

Dipity
Dipity focused entirely on some form of data visualization the timeline.

Sharing

Companies can create their own timelines for topics, products, projects etc. and share publicly or with invited colleagues.

ChartsBin

ChartsBin converts input records to the desired card format.

Share and comment
The finished cards can be easily shared and commented on.

Venngage
Venngage dominated all major chart types. However, the premium upgrade is mandatory for defaulted.

Rights Management

Publishing (either public or private) of the graphics or parts of works out with a single click from the editor on social networks.

Connecting Worlds


Apparently not solve the new technologies, the classic in the near future (even) from. Currently complement both worlds and expand together the limits of feasibility. Logically establish today’s tools that encourage interaction Connectors and Tools Federation are the next big trend. Large RDBMS vendors such as Oracle and IBM as well as the open source community provide almost on a quarterly basis appropriate software for the efficient exchange of data and the execution of distributed queries.

Read data integration tools and already today writes data in both directions. And the virtualization of data on Federation Engines integrated data sources of all stripes for comprehensive use. The latter sounds like the miracle solution to all problems, however, the analysis of distributed data a fundamental problem When linking large amounts of data across system boundaries, are network and media break the required performance in the way. Not in vain, even provide extremely specialized database cluster for better scaling the highest demands on configuration, hardware and software. When virtual cobbling together heterogeneous data structures and platforms, this problem occurs naturally still produce much clearer. Distributed architectures are useful and appropriate for certain types of questions - but you are no panacea.

But we still need solutions in order to keep response times with virtually any ad hoc user queries minimal increase without their preprocessing. Here are in-memory databases or relevant database extensions into play. However, they have held in the past two to three years collection in almost every major RDBMS and are therefore in many DWH environments already in use -or are at least in the testing phase.
Conclusion

Sensor, Internet or log data, RFID, text and network analysis All in real time save cost and evaluate. All data collected, together with more than years from hundreds of sources - without limitation to predefined structures, but self-explanatory and easy to use A Utopia Basically yes - but one that with the increasing integration of data warehouse and Big Data technologies is approaching a small piece.

The portal is difficult because suitable business cases are seen by the RDBMS glasses not always at first glance. And yet almost everyone wants via social media to know what is being said in public about them, which customers are the most interesting, as Produktionsanalagen or development processes can be optimized (Predictive Analytics, Semantic Web), or how information is filtered out of unstructured data (Natural Language Processing).

A simple entry with a high knowledge potential is to build a potential cloud-based basic infrastructure, especially for IT-savvy analysts - the oft-cited Data Scientists. The availability of a Big Data platform, consisting of a Hadoop distribution, not more restrictive tool selection and enriched with conventional and new data sources, allows quick insights through prototypical development of new analytical methods. So ideas can be quickly validated and passed with success in the regular development. As a side effect arises valuable Big Data’s expertise and the right instinct for opportunities and risks.

The next step may be to the data warehouse platform to relieve cost while integrating unstructured data in the BI process. This archive function is present in numerous companies to the test and is sometimes also used successfully already.

When Data-Lake approach all potentially relevant data are cost-effective, centrally stored in the original format and in the finest granularity on a Big Data platform. There they are for all systems are available - including pre-processing, distribution mechanisms and access protection. This possible next step towards the future is discussed in more violent today.

From the central information hub that across virtualized and performance all information and provides this processed regardless of the location and suitable for any purpose - must business users probably still a while dreaming.

No comments:

Post a Comment