`

Tez: 1 Apache Tez: A New Chapter in Hadoop Data Processing

    博客分类:
  • Tez
 
阅读更多

What is Apache Tez?

Apache Tez generalizes the MapReduce paradigm to execute a complex DAG (directed acyclic graph) of tasks. It also represents the next logical next step for Hadoop 2 and the introduction of with YARN and its more general-purpose resource management framework.

While MapReduce has served masterfully as the data processing backbone for Hadoop, its batch-oriented nature makes it unsuited for certain workloads like interactive query. Tez represents an alternate to the traditional MapReduce that allows for jobs to meet demands for fast response times and extreme throughput at petabyte scale. A great example of a benefactor of this new approach is Apache Hive and the work being done in theStinger Initiative

 

Motivation

Distributed data processing is the core application that Apache Hadoop is built around. Storing and analyzing large volumes and variety of data efficiently has been the cornerstone use case that has driven large scale adoption of Hadoop, and has resulted in creating enormous value for the Hadoop adopters. Over the years, while building and running data processing applications based on MapReduce, we have understood a lot about the strengths and weaknesses of this framework and how we would like to evolve the Hadoop data processing framework to meet the evolving needs of Hadoop users. As the Hadoop compute platform moves into its next phase with YARN, it has decoupled itself from MapReduce being the only application, and opened the opportunity to create a new data processing framework to meet the new challenges. Apache Tez aspires to live up to these lofty goals.

Key Design Themes

Higher-level data processing applications like Hive and Pig need an execution framework that can express their complex query logic in an efficient manner and then execute it with high performance. Apache Tez has been built around the following main design themes that solve these key challenges in the Hadoop data processing domain.

Ability to express, model and execute data processing logic

tez1Tez models data processing as a dataflow graph with vertices in the graph representing application logic and edges representing movement of data. A rich dataflow definition API allows users to express complex query logic in an intuitive manner and it is a natural fit for query plans produced by higher-level declarative applications like Hive and Pig. As an example, the diagram shows how to model an ordered distributed sort using range partitioning. The Preprocessor stage sends samples to a Sampler that calculates sorted data ranges for each data partition such that the work is uniformly distributed. The ranges are sent to Partition and Aggregate stages that read their assigned ranges and perform the data scatter-gather. This dataflow pipeline can be expressed as a single Tez job that will run the entire computation. Expanding this logical graph into a physical graph of tasks and executing it is taken care of by Tez.

Flexible Input-Processor-Output task model

tez2Tez models the user logic running in each vertex of the dataflow graph as a composition of Input, Processor and Output modules. Input & Output determine the data format and how and where it is read/written. Processor holds the data transformation logic. Tez does not impose any data format and only requires that a combination of Input, Processor and Output must be compatible with each other with respect to their formats when they are composed to instantiate a vertex task. Similarly, an Input and Output pair connecting two tasks should be compatible with each other. In the diagram, we can see how composing different Inputs, Outputs and Processors can produce different tasks.

Performance via Dynamic Graph Reconfiguration

tez3Distributed data processing is dynamic by nature and it is extremely difficult to statically determine optimal concurrency and data movement methods a priori. More information is available during runtime, like data samples and sizes, which may help optimize the execution plan further. We also recognize that Tez by itself cannot always have the smarts to perform these dynamic optimizations. The design of Tez includes support for pluggable vertex management modules to collect relevant information from tasks and change the dataflow graph at runtime to optimize for performance and resource usage. The diagram shows how Tez can determine an appropriate number of reducers in a MapReduce like job by observing the actual data output produced and the desired load per reduce task.

Performance via Optimal Resource Management

tez4Resources acquisition in a distributed multi-tenant environment is based on cluster capacity, load and other quotas enforced by the resource management framework like YARN. Thus resource available to the user may vary over time and over different executions of the job. It becomes paramount to be able to efficiently use all available resources to run a job as fast as possible during one instance of execution and predictably over different instances of execution. The Tez execution engine framework allows for efficient acquisition of resources from YARN along with extensive reuse of every component in the pipeline such that no operation is duplicated unnecessarily. These efficiencies are exposed to user logic, where possible, such that users may also leverage this for efficient caching and avoid work duplication. The diagram shows how Tez runs multiple containers within the same YARN container host and how users can leverage that to store their own objects that may be shared across tasks.

We hope this brief overview about the philosophy and design of Apache Tez will throw some light on the aspirations of the project and how we hope to work with the Apache Hadoop community to bring them to life. Apache Hive and Apache Pig projects have already show deep interest in integrating with Tez.

 

 

 

 

 

 

http://hortonworks.com/blog/apache-tez-a-new-chapter-in-hadoop-data-processing/

分享到:
评论

相关推荐

    tez:Apache Tez

    Apache Tez是一个通用的数据处理管道引擎,被设想为用于更高抽象的低级引擎,例如Apache Hadoop Map-Reduce,Apache Pig,Apache Hive等。 从本质上讲,tez非常简单,只有两个组成部分: 数据处理流水线引擎可以...

    Bikas Saha:Apache Tez-A框架模型和构建Hadoop数据处理应用程序

    该文档来自于Apache Hadoop和Tez项目PMC成员Bikas Saha,在2014中国大数据技术大会大数据技术分论坛的演讲“Apache Tez-A Framework to Model and Build Hadoop Data Processing Applications”。

    docker-hive-on-tez:在 Tez 上运行的 Apache Hive 的 Docker 镜像

    当前版本Apache Hive(主干版) Apache Tez 0.5.2 Apache Hadoop 2.5.2 PostgreSQL 9.3(Hive 元存储后端)在 Mac OS X 上运行此步骤仅适用于 Mac OS X,因为 Mac OS X 本身不支持 docker。要在 Mac OS X 上运行 ...

    源码apache-tez-0.8.3编译后的hadoop2.8.3版本hive-tez包tez-0.8.3.tar.gz

    源码使用的是apache-tez-0.8.3,对应的hadoop版本2.8.3,源码包中的nodejs的版本是v0.12.3,很难编译通过,最后把nodejs改成了v4.0.0才编译通过tez-ui2模块。

    源码apache-tez-0.8.3编译后的hadoop2.7.3版本hive-tez包tez-0.8.3.tar.gz

    源码使用的是apache-tez-0.8.3,对应的hadoop版本2.7.3,源码包中的nodejs的版本是v0.12.3,很难编译通过,最后把nodejs改成了v4.0.0才编译通过tez-ui2模块。

    Practical.Hive.A.Guide.to.Hadoops.Data.Warehouse.System.1484202724

    Chapter 1: Setting the Stage for Hive: Hadoop Chapter 2: Introducing Hive Chapter 3: Hive Architecture Chapter 4: Hive Tables DDL Chapter 5: Data Manipulation Language (DML) Chapter 6: Loading Data ...

    Programming Pig: Dataflow Scripting with Hadoop [2016]

    Use Pig with Apache Tez to build high-performance batch and interactive data processing applications Create your own load and store functions to handle data formats and storage mechanisms

    TEZ:训练pytorch模型更快rrrr......。-Python开发

    tez:训练pytorch模型fastrrrr ....... tez:训练pytorch模型fastrrrr .......注意:当前,我们不接受任何拉取请求! 所有公共关系将被关闭。 如果您需要某个功能或某些功能不起作用,请创建一个问题。 意思是“锐利...

    apache-tez-0.9.2-bin.tar.gz

    apache tez 安装

    apache-tez-0.9.0-bin.tar.gz

    Tez是Apache开源的支持DAG作业的计算框架,它直接源于MapReduce框架,核心思想是将Map和Reduce两个操作进一步拆分,即Map被拆分成Input、Processor、Sort、Merge和Output, Reduce被拆分成Input、Shuffle、Sort、...

    apache-tez-0.8.3-src.tar.gz

    Tez是Apache开源的支持DAG作业的计算框架,它直接源于MapReduce框架,核心思想是将Map和Reduce两个操作进一步拆分,即Map被拆分成Input、Processor、Sort、Merge和Output, Reduce被拆分成Input、Shuffle、Sort、...

    Apache TEZ部署手册

    Apache TEZ 部署手册 的各个步骤,包括打包等步骤说明

    tez about hadoop-2.7.1

    tez的关于hadoop-2.7.1编译好的包

    apache-tez-0.9.2-src.tar.gz

    Tez是Apache最新的支持DAG作业的开源计算框架,它可以将多个有依赖的作业转换为一个作业从而大幅提升DAG作业的性能。Tez并不直接面向最终用户——事实上它允许开发者为最终用户构建性能更快、扩展性更好的应用程序。...

    apache-tez-0.10.1-src.tar.gz

    Tez是Apache最新的支持DAG作业的开源计算框架,它可以将多个有依赖的作业转换为一个作业从而大幅提升DAG作业的性能。Tez并不直接面向最终用户——事实上它允许开发者为最终用户构建性能更快、扩展性更好的应用程序。...

    tez:Tez是用于PyTorch的超级简单且轻巧的Trainer。 它还带有许多实用程序,可用于解决PyTorch中90%以上的深度学习项目

    Tez:简单的pytorch培训师 注意:当前,我们不接受任何拉取请求! 所有公共关系将被关闭。 如果您需要某个功能或某些功能不起作用,请创建一个问题。 意思是“锐利,快速,活跃”。 这是一个简单的要点库,使您的...

    apache-tez-0.10.2-src.tar.gz

    Tez是Apache最新的支持DAG作业的开源计算框架,它可以将多个有依赖的作业转换为一个作业从而大幅提升DAG作业的性能。Tez并不直接面向最终用户——事实上它允许开发者为最终用户构建性能更快、扩展性更好的应用程序。...

    apache-tez-0.10.2-bin.tar.gz

    Tez是Apache最新的支持DAG作业的开源计算框架,它可以将多个有依赖的作业转换为一个作业从而大幅提升DAG作业的性能。Tez并不直接面向最终用户——事实上它允许开发者为最终用户构建性能更快、扩展性更好的应用程序。...

    tez-0.8.5-hadoop-2.6.5-bin.zip

    tez-0.8.5-hadoop-2.6.5-bin,java1.8。hadoop2.6.5,jdk1.8,tez-0.8.5的编译包。

    为Apache Hadoop、Spark以及Tez等大数据计算框架集成.zip

    hadoop-cos(CosN文件系统)为Apache Hadoop、Spark以及Tez等大数据计算框架集成提供支持,可以像访问HDFS一样读写存储在腾讯云COS上的数据。同时也支持作为Druid等查询与分析引擎的Deep Storage. 各领域数据集,...

Global site tag (gtag.js) - Google Analytics