`

Tez: 3 Runtime API in Apache Tez

    博客分类:
  • Tez
 
阅读更多

Apache Tez models data processing as a dataflow graph, with the vertices in the graph representing processing of data and edges representing movement of data between the processing. The user logic, that analyses and modifies the data, sits in the vertices. Edges determine the consumer of the data, how the data is transferred and the dependency between the producer and consumer vertices.

For users of MapReduce (MR), the most primitive functionality that Tez can provide is an ability to run a chain of Reduce stages as compared to a single Reduce stage in the current MR implementation. Via the Task API, Tez can do this and much more by facilitating execution of any form of processing logic that does not need to be retrofitted into a Map or Reduce task and also by supporting multiple options of data transfer between different vertices that are not restricted to the MapReduce shuffle transport mechanism.

The Building Blocks of Tez

The Task API provides the building blocks for a user to plug-in their logic to analyze and modify data into the vertex and augment this processing logic with the necessary plugins to transfer and route data between vertices.

Tez models the user logic running in each vertex as a composition of a set of Inputs, a Processor and a set of Outputs.

  • Input: An input represents a pipe through which a processor can accept input data from a data source such as HDFS or the output generated by another vertex.
  • Processor: The entity responsible for consuming one or more Inputs and producing one or more Outputs.
  • Output: An output represents a pipe through which a processor can generate output data for another vertex to consume or to a data sink such as HDFS.

Given that an edge in a DAG is a logical entity that represents a number of physical connections between the tasks of 2 connected vertices, to improve ease of programmability for a developer implementing a new Processor, there are 2 kinds of Inputs and Outputs to either expose or hide the level of complexity:

  • Logical: A corresponding pair of a LogicalInput and a LogicalOutput represent the  logical edge between 2 vertices. The implementation of Logical objects hides all the underlying physical connections and exposes a single view to the data.
  • Physical: The pair of Physical Input and Output represents the connection between a task of the Source vertex and a task of a Destination vertex.

An example of the Reduce stage within an MR job would be a Reduce Processor that receives data from the maps via ShuffleInput and generates output to HDFS. Likewise, an intermediate Reduce stage in an MRR chain would be quite similar to the final Reduce stage except for the difference in the Output type.

 

 

 

Tez Runtime API



  To implement a new Input, Processor or Output, a user to implement the appropriate interfaces mentioned above. All objects are given a Context object in their initialize functions. This context is the hook for these objects to communicate to the Tez framework. The Inputs and Outputs are expected to provide implementations for their respective Readers and Writers which are then used by the Processor to read/write data. In a task, after the Tez framework has initialized all the necessary Inputs, Outputs and the Processor, the  Tez framework invokes the Processor’s run function and passes the appropriate handles to all the Inputs and Outputs for that particular task.

Tez allows all inputs and outputs to be pluggable. This requires support for passing of information from the Output of a source vertex to the Input of the destination vertex. For example, let us assume that the Output of a source vertex writes all of its data to a key-value store. The Output would need to communicate the “key” to the Input of the next stage so that the Input can retrieve the correct data from the key-value store. To facilitate this, Tez uses Events.

 

Events in Tez

Events in Tez are a way to pass information amongst different components.

  • The Tez framework uses Events to pass information of system events such as task failures to the required components.
  • Inputs of a vertex can inform the framework of any failures encountered when trying to retrieve data from the source vertex’s Output that in turn can be used by the framework to take failure recovery measures.
  • An Output can pass information of the location of the data, which it generates, to the Inputs of the destination vertex.  An example of this is described in the Shuffle Event diagram which shows how the output of a Map stage informs the Shuffle Input of the Reduce stage of the location of its output via a Data Movement Event.



 Another use of Events is to enable run-time changes to the DAG execution plan. For example, based on the amount of the data being generated by a Map stage, it may be more optimal to run less reduce tasks within the following Reduce stage. Events generated by Outputs are routed to the pluggable Vertex/Edge management modules, allowing them to make the necessary decisions to modify some run-time parameters as needed.

 

Available implementations of Inputs/Processors/Outputs

The flexibility of Tez allows anyone to implement their Inputs and Outputs, whether they use blocking/non-blocking transport protocols, handle data in the form of raw bytes/records/key-value pairs etc., and build Processors to handle these variety of Inputs and Outputs.

There is already a small repository of various implementations of Inputs/Outputs/Processors:

  • MRInput and MROutput: Basic input and outputs to handle data to/from HDFS that are MapReduce compatible as they use MapReduce constructs such as InputFormat, RecordReader, OutputFormat and RecordWriter.
  • OnFileSortedOutput and ShuffleMergedInput: A pair of key-value based Input and Output that use the local disk for all I/O and provide the same sort+merge functionality that is required for the “shuffle” edge between the Map and Reduce stages in a MapReduce job.
  • OnFileUnorderedKVOutput and ShuffledUnorderedKVInput: These are similar to the shuffle pair mentioned earlier except that the data is not sorted implicitly. This can be a big performance boost in various situations.
  • MapProcessor and ReduceProcessor: As the names suggest, these processors are available for anyone trying to run a MapReduce job on the Tez execution framework. They can be used to run an MRR chain too.

 

orignal doc:

http://hortonworks.com/blog/task-api-apache-tez/

 

 

 

 

 

 

  • 描述: tez
  • 大小: 128.8 KB
  • 大小: 60.1 KB
  • 大小: 67.6 KB
分享到:
评论

相关推荐

    tez:Apache Tez

    Apache Tez是一个通用的数据处理管道引擎,被设想为用于更高抽象的低级引擎,例如Apache Hadoop Map-Reduce,Apache Pig,Apache Hive等。 从本质上讲,tez非常简单,只有两个组成部分: 数据处理流水线引擎可以...

    docker-hive-on-tez:在 Tez 上运行的 Apache Hive 的 Docker 镜像

    在 Tez 上运行 Apache Hive 的 Docker 镜像此存储库包含一个 docker 文件,用于构建 docker 映像以在 Tez 上运行 Apache Hive。 这个 docker 文件依赖于我的其他包含和 基础镜像的存储库。当前版本Apache Hive(主干...

    hive on tez 常见报错问题收集

    3)Current usage: 1.0 GB of 1 GB physical memory used;2.7 GB of 2.1 GB virtual memory used. Killing container 4)java.lang.RuntimeException: java.lang.RuntimeException: Hive Runtime Error while ...

    TEZ:训练pytorch模型更快rrrr......。-Python开发

    tez:训练pytorch模型fastrrrr ....... tez:训练pytorch模型fastrrrr .......注意:当前,我们不接受任何拉取请求! 所有公共关系将被关闭。 如果您需要某个功能或某些功能不起作用,请创建一个问题。 意思是“锐利...

    apache-tez-0.9.0-bin.tar.gz

    Tez是Apache开源的支持DAG作业的计算框架,它直接源于MapReduce框架,核心思想是将Map和Reduce两个操作进一步拆分,即Map被拆分成Input、Processor、Sort、Merge和Output, Reduce被拆分成Input、Shuffle、Sort、...

    apache-tez-0.9.2-bin.tar.gz

    apache tez 安装

    tez:Tez是用于PyTorch的超级简单且轻巧的Trainer。 它还带有许多实用程序,可用于解决PyTorch中90%以上的深度学习项目

    Tez:简单的pytorch培训师 注意:当前,我们不接受任何拉取请求! 所有公共关系将被关闭。 如果您需要某个功能或某些功能不起作用,请创建一个问题。 意思是“锐利,快速,活跃”。 这是一个简单的要点库,使您的...

    源码apache-tez-0.8.3编译后的hadoop2.7.3版本hive-tez包tez-0.8.3.tar.gz

    源码使用的是apache-tez-0.8.3,对应的hadoop版本2.7.3,源码包中的nodejs的版本是v0.12.3,很难编译通过,最后把nodejs改成了v4.0.0才编译通过tez-ui2模块。

    Apache TEZ部署手册

    Apache TEZ 部署手册 的各个步骤,包括打包等步骤说明

    apache-tez-0.8.3-src.tar.gz

    Tez是Apache开源的支持DAG作业的计算框架,它直接源于MapReduce框架,核心思想是将Map和Reduce两个操作进一步拆分,即Map被拆分成Input、Processor、Sort、Merge和Output, Reduce被拆分成Input、Shuffle、Sort、...

    apache-tez源码

    学习大数据的小白用得到的apache tez源码,通过源码可以感受大牛们的代码风格和思维逻辑

    源码apache-tez-0.8.3编译后的hadoop2.8.3版本hive-tez包tez-0.8.3.tar.gz

    源码使用的是apache-tez-0.8.3,对应的hadoop版本2.8.3,源码包中的nodejs的版本是v0.12.3,很难编译通过,最后把nodejs改成了v4.0.0才编译通过tez-ui2模块。

    apache-tez-0.8.5-bin.tar.gz

    只支持 hadoop2.6以上,hive必须是apache版本的 如果使用CDH的hive需要处理jar包冲突问题,太麻烦了

    storm-tez:使用TEZ在纱线POC上进行风暴

    Storm-tez 使用TEZ在纱线POC上进行风暴

    apache-tez-0.9.1-bin.tar.gz

    用Hive直接编写MR程序,假设有四个有依赖关系的MR作业,上图中,绿色是Reduce Task,云状表示写屏蔽,...Tez可以将多个有依赖的作业转换为一个作业,这样只需写一次HDFS,且中间节点较少,从而大大提升作业的计算性能

    阿托斯 (ATOS)数字式位置控制器Z-RI-TEZ.pdf

    阿托斯 (ATOS)数字式位置控制器Z-RI-TEZpdf,阿托斯 (ATOS)数字式位置控制器Z-RI-TEZ:数字式,与阀集成,适用于轴运动控制

    apache-tez-0.10.2-src.tar.gz

    Tez是Apache最新的支持DAG作业的开源计算框架,它可以将多个有依赖的作业转换为一个作业从而大幅提升DAG作业的性能。Tez并不直接面向最终用户——事实上它允许开发者为最终用户构建性能更快、扩展性更好的应用程序。...

    apache-tez-0.9.2-src.tar.gz

    Tez是Apache最新的支持DAG作业的开源计算框架,它可以将多个有依赖的作业转换为一个作业从而大幅提升DAG作业的性能。Tez并不直接面向最终用户——事实上它允许开发者为最终用户构建性能更快、扩展性更好的应用程序。...

    apache-tez-0.10.1-src.tar.gz

    Tez是Apache最新的支持DAG作业的开源计算框架,它可以将多个有依赖的作业转换为一个作业从而大幅提升DAG作业的性能。Tez并不直接面向最终用户——事实上它允许开发者为最终用户构建性能更快、扩展性更好的应用程序。...

    apache-tez-0.10.2-bin.tar.gz

    Tez是Apache最新的支持DAG作业的开源计算框架,它可以将多个有依赖的作业转换为一个作业从而大幅提升DAG作业的性能。Tez并不直接面向最终用户——事实上它允许开发者为最终用户构建性能更快、扩展性更好的应用程序。...

Global site tag (gtag.js) - Google Analytics