<?xml version="1.0" encoding="US-ASCII"?>
<!-- This is built from a template for a generic Internet Draft. Suggestions for
     improvement welcome - write to Brian Carpenter, brian.e.carpenter @ gmail.com 
     This can be converted using the Web service at http://xml.resource.org/ -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<!-- You want a table of contents -->
<!-- Use symbolic labels for references -->
<!-- This sorts the references -->
<!-- Change to "yes" if someone has disclosed IPR for the draft -->
<!-- This defines the specific filename and version number of your draft (and inserts the appropriate IETF boilerplate -->
<?rfc sortrefs="yes"?>
<?rfc toc="yes"?>
<?rfc symrefs="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<?rfc topblock="yes"?>
<?rfc comments="no"?>
<rfc category="info"
     docName="draft-xu-rtgwg-topo-aware-collective-with-inc-00"
     ipr="trust200902">
  <front>
    <title abbrev="Routing Area Working Group">Topology-aware Collective
    Communication in In-Network Computing Enabled Network: Problem Statement
    and Requirements</title>

    <author fullname="Shiping Xu" initials="S." surname="Xu">
      <organization>China Mobile</organization>

      <address>
        <postal>
          <street/>

          <city>Beijing</city>

          <code>100053</code>

          <country>China</country>
        </postal>

        <email>xushiping@chinamobile.com</email>
      </address>
    </author>

    <author fullname="Kehan Yao" initials="K." surname="Yao">
      <organization>China Mobile</organization>

      <address>
        <postal>
          <street/>

          <city>Beijing</city>

          <code>100053</code>

          <country>China</country>
        </postal>

        <email>yaokehan@chinamobile.com</email>
      </address>
    </author>

    <date day="10" month="July" year="2023"/>

    <area>Routing</area>

    <workgroup>Routing Area Working Group</workgroup>

    <keyword>In-Network Computing; Message Passing Interface; High Performance
    Computing</keyword>

    <abstract>
      <t>In this document, the mapping mechanism between the logical and
      physical topology of collective communication is analysed in In-Network
      Computing(INC) enabled network, as well as the impact of topology-aware
      collective communication algorithms on INC enabled large-scale computing
      clusters. Requirements are also proposed to design efficient mapping
      mechanism between logical and physical topology and topology-aware
      collective communication algorithms.</t>
    </abstract>
  </front>

  <middle>
    <section anchor="intro" title="Introduction">
      <t>Large scale supercomputing systems have witnessed significant growth
      in the recent history. At the heart of these systems are compute nodes
      based on modern multi-core architectures and high speed networks. These
      systems offer vast amounts of computing power and resources to
      application developers and are allowing scientific applications to scale
      out to tens of thousands of processes.</t>

      <t>These processes rely on Message Passing Interface (MPI) for
      information exchange and complete parallel computing. The hardware
      network in reality is a physical network, while the communication
      between processes that are independent of hardware devices is abstracted
      as a logical network. An important aspect of communication in parallel
      computing is the rational mapping between logical network and physical
      network. When INC is introduced, the network hardware can also join the
      process of collective communication., which in turn will impact the
      overall communication model. Therefore, In INC enabled large-scale
      clusters, the mapping rules need to be adjusted accordingly.</t>

      <t>In large scale clusters, the network contention can significantly
      impact the performance of applications when the processor allocation is
      scattered across different racks in the cluster. It is critical to
      discover the topology of such clusters and design collective message
      exchange algorithms that are aware of the topology in order to improve
      the overall performance of real-world applications. After introducing
      INC, the topology discovery algorithm should not be limited to factors
      such as network structure and bandwidth, but also consider factors such
      as INC capacities and computational load.</t>
    </section>

    <section title="Conventions Used in This Document">
      <section title="Terminology">
        <t>INC In-Network Computing</t>

        <t>MPI Message Passing Interface</t>
      </section>

      <section title="Requirements Language">
        <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
        "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
        "OPTIONAL" in this document are to be interpreted as described in BCP
        14<xref target="RFC2119"/><xref target="RFC8174"/> when, and only
        when, they appear in all capitals, as shown here.</t>
      </section>
    </section>

    <section title="Problem Statement">
      <t>In traditional mode, computing tasks are completed by computers and
      servers in the cluster, and after enabling INC, some of the computing
      tasks are transferred to network devices. As a result, for the same MPI
      primitive, compared to traditional mode, after enabling INC, the
      communication subjects in the logical topology can not only be mapped to
      computers, but also to network devices. At the same time, the
      implementation of certain MPI primitives based on INC may result in
      topological difference compared to traditional patterns. The current
      topology mapping mechanism does not consider the content above.</t>

      <t>How to use topology-aware algorithms to improve MPI primitive
      communication performance and reduce communication costs in large-scale
      clusters has been a hot research direction. <xref target="TopoIB"/>
      presents efficient topology-aware algorithms for two collective
      communication primitives and proposed a communication model to analyze
      the communication overhead of large-scale cluster communication. In
      <xref target="Themis"/>, a new scheduling mechanism and topology-aware
      algorithm are proposed from the perspective of improving network
      bandwidth utilization, and it was verified that the network bandwidth
      utilization rate of a single AllReduce operation can be increased by
      1.72 times. But when INC is enabled, these topology detection algorithms
      will not only be limited to network characteristics such as bandwidth
      and communication overhead, but should simultaneously consider the
      computing and processing capabilies of network devices themselves.</t>

      <t>Hence, several problems are raised:</t>

      <t>* How to properly map the communication logical topology subjects to
      the INC enabled physical network subjects?</t>

      <t>* How will enabling INC change the logical network topologies of MPI
      primitives and what challenges will it bring?</t>

      <t>* How do we efficiently discover the topology of an INC enabled large
      scale cluster?</t>

      <t>* What are the challenges involved in designing efficient collective
      algorithms that are aware of the INC enabled network topology?</t>
    </section>

    <section title="Requirements">
      <t>The topology mapping algorithm between logical and physical networks
      in large-scale clusters enabled by INC, as well as the topology-aware
      collective communication algorithms used to enhance cluster
      communication, need to meet the following requirements:</t>

      <t>* INC enabled communication entities in large-scale clusters MUST not
      only support mapping to computing nodes in physical network, but also
      supporting mapping to network devices in physical network.</t>

      <t>* After introducing INC, logical communication may change. MPI
      primitives, for example, AllReduce, may correspond to one or more
      logical topologies that support INC. However, from the aspect of
      computation results, the implementation of logical topology that
      supports INC MUST be equivalent to traditional methods.</t>

      <t>* Topology detection algorithms in large-scale clusters that enable
      INC not only need to consider network factors such as communication
      overhead and path bandwidth, but also consider the INC capability and
      computational load of network devices, such as SINC <xref
      target="I-D.lou-rtgwg-sinc"/>.</t>

      <t>* The topology-aware collective communication algorithm SHOULD
      consider the network path load as well as the impact of background
      traffic on cluster communication performance in INC enabled large-scale
      clusters.</t>

      <t>* A reasonable evaluation model for INC enabled large-scale cluster
      is REQUIRED, taking into account the factors such as connectivity status
      and computing capabilities in network devices.</t>

      <t>* The topology mapping algorithm and topology detection algorithm
      SHOULD support the fallback mechanism, which can remap the logical
      network to the traditional mode and achieve path detection after an INC
      failure.</t>
    </section>

    <section anchor="Security" title="Security Considerations">
      <t>TBD.</t>
    </section>

    <section anchor="IANA" title="IANA Considerations">
      <t>TBD.</t>
    </section>
  </middle>

  <back>
    <references title="Informative References">
      <reference anchor="TopoIB"
                 target="https://doi.org/10.1109/IPDPSW.2010.5470853">
        <front>
          <title>Designing topology-aware collective communication algorithms
          for large scale InfiniBand clusters: Case studies with Scatter and
          Gather</title>

          <author>
            <organization>Kandalla K C, Subramoni H, Vishnu A, et
            al.</organization>
          </author>

          <date month="December" year="2010"/>
        </front>
      </reference>

      <reference anchor="Themis"
                 target="https://doi.org/10.48550/arXiv.2110.04478">
        <front>
          <title>Themis: A Network Bandwidth-Aware Collective Scheduling
          Policy for Distributed Training of DL Models</title>

          <author>
            <organization>Rashidi S, Won W, Srinivasan S, et
            al.</organization>
          </author>

          <date month="May" year="2021"/>
        </front>
      </reference>

      <?rfc include="reference.RFC.2119"?>

      <?rfc include="reference.RFC.8174"?>

      <?rfc include="reference.I-D.lou-rtgwg-sinc"?>
    </references>
  </back>
</rfc>
