TOC 
P2PSIP Working GroupS. Baset
Internet-DraftH. Schulzrinne
Intended status: Standards TrackColumbia University
Expires: August 5, 2007February 2007


Peer-to-Peer Protocol (P2PP)
draft-baset-p2psip-p2pcommon-01

Status of this Memo

By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as “work in progress.”

The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt.

The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html.

This Internet-Draft will expire on August 5, 2007.

Copyright Notice

Copyright © The Internet Society (2007).

Abstract

This document defines Peer-to-Peer Protocol (P2PP), an application-layer protocol, for creating and maintaining an overlay of participant nodes. The overlay can be created using various structured and unstructured peer-to-peer protocols such as Chord, Pastry, Gnutella, and Gia.



Table of Contents

1.  Introduction
2.  Overview of P2PP Functionality
    2.1.  Salient Features of the Peer-to-Peer Protocol
3.  Terminology
4.  Design Overview
    4.1.  Overall Design Approach
    4.2.  To SIP or not to SIP
    4.3.  To ASCII or not to ASCII
    4.4.  P2PP Interface, Messages, and API
    4.5.  P2PP and SIP
    4.6.  Request Routing
    4.7.  Transport
        4.7.1.  Issues in Recursive Routing
        4.7.2.  Issues in Iterative Routing
        4.7.3.  P2PP Message Size
        4.7.4.  Listening Ports
    4.8.  NAT and Firewall Traversal
    4.9.  Node Capabilities
5.  P2PP Processing Overview
    5.1.  Node State
    5.2.  Request Generation and Response Processing
        5.2.1.  Issues in Recursive Routing
        5.2.2.  Issues in Iterative Routing
    5.3.  Request Processing and Response Generation
    5.4.  Message State Machine
        5.4.1.  Request State Machine
        5.4.2.  Response State Machine
    5.5.  Timers
6.  Message Formats
    6.1.  Join
    6.2.  Leave
    6.3.  Keep-alive
    6.4.  Peer-Lookup
    6.5.  Update
    6.6.  Query
    6.7.  Replicate
    6.8.  Insert (put)
    6.9.  Lookup (get)
    6.10.  Remove
7.  Bit-Level Formats and Errors
    7.1.  Common Header
    7.2.  General Object Format
    7.3.  P2PP TLV Objects
        7.3.1.  Peer-Info
        7.3.2.  Request-Options
        7.3.3.  P2P-Options
        7.3.4.  Routing-Table
        7.3.5.  Neighbor-table
        7.3.6.  PLookup
        7.3.7.  Resource-Object
        7.3.8.  Expires
        7.3.9.  Owner
        7.3.10.  Credentials
    7.4.  Errors
        7.4.1.  Error Object
        7.4.2.  Error Codes
8.  API between P2PP and Applications
    8.1.  Query
    8.2.  Join
    8.3.  Leave
    8.4.  Insert
    8.5.  Remove
    8.6.  Lookup
9.  Security Considerations
10.  Open Issues
11.  Acknowledgements
12.  IANA Considerations
13.  References
    13.1.  Normative References
    13.2.  Informative References
Appendix A.  Background
    A.1.  Structured Overlays
        A.1.1.  Chord
        A.1.2.  Pastry
        A.1.3.  Kademlia
        A.1.4.  Bamboo/OpenDHT
    A.2.  DHT Commonalities
    A.3.  Unstructured Overlays
        A.3.1.  Gia
Appendix B.  Examples
    B.1.  Join (Pastry)
    B.2.  Join (Gia)
    B.3.  Join (Chord)
    B.4.  Lookup (Pastry)
    B.5.  Lookup (Gia)
    B.6.  Lookup (Chord)
§  Authors' Addresses
§  Intellectual Property and Copyright Statements




 TOC 

1.  Introduction

This document defines Peer-to-Peer Protocol (P2PP) for creating and maintaining an overlay of participant nodes. P2PP also allows non-participant nodes to use overlay services through participant nodes. The design of P2PP exploits commonalities in the peer-to-peer (p2p) protocols such as Chord [12] (Stoica, I., Morris, R., Liben-Nowell, D., Karger, D., Kaashoek, M., Dabek, F., and H. Balakrishnan, “Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications,” .), CAN [13] (Ratmasamy, S., Francis, P., Handley, M., Karp, R., and S. Shenker, “A Scalable Content-Addressable Network,” August 2001.), Pastry [14] (Rowstron, A. and P. Druschel, “Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems,” March 2002.), Kademlia [15] (Maymounkov, P. and D. Mazieres, “Kademlia: A Peer-to-Peer Information System Based on the XOR Metric,” March 2002.), and Gia [19] (Chawathe, Y., Ratnasamy, S., Breslau, L., Lanham, N., Kaashoek, M., and S. Shenker, “Making gnutella-like p2p systems scalable,” 2003.) thereby defining a protocol that does not contain any peer-to-peer protocol specific details and has an extension mechanism to incorporate a protocol-specific feature. Though general in scope, the specification does not claim that P2PP can be used to implement any peer-to-peer protocol.



 TOC 

2.  Overview of P2PP Functionality

P2PP is an application-layer protocol that can be used to form and maintain an overlay among participant nodes. It provides mechanisms for nodes to join, leave, publish, or search for a resource-object in the overlay. A common aspect of these operations is to find an appropriate node in the overlay. A node can find an appropriate node by maintaining a table of other nodes, called routing table. Since, an overlay can comprise a large number of nodes, a node's routing table only contains a subset of these nodes.

For every request received, a node checks if it can accomplish the request. If it cannot complete the request, it sends the message to node(s) which is most likely to accomplish the request. A node determines the destination of the request from its routing-table by performing a local nextHop() operation. This method of message forwarding is known as recursive-routing. The routing methods are explained in Section 4.6 (Request Routing).

It is important to note that a node determines the next hop for a message solely from its routing table; it does not consult any other nodes. Different peer-to-peer protocols have different mechanisms for determining the next hop for a request; however this is done on a per-node basis. This aspect keeps the design of P2PP independent of any existing peer-to-peer protocols.



 TOC 

2.1.  Salient Features of the Peer-to-Peer Protocol

The protocol can be used to implement a wide range of structured and unstructured p2p algorithms.

Message size can exceed MTU, if UDP is used. The specification does not specify a reliable communication layer on top of UDP and instead recommend the use of TCP or TLS over TCP.

Nodes in a single overlay only run one unique P2P algorithm such as Chord or Pastry.

Nodes can participate in different overlays.



 TOC 

3.  Terminology

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [1] (Bradner, S., “Key words for use in RFCs to Indicate Requirement Levels,” March 1997.).

Some of the terminology has been borrowed from the P2P terminology draft [8] (Willis, D., Bryan, D., Matthews, P., and E. Shim, “Concepts and Terminology for Peer-to-Peer SIP,” October 2006.).

P2PSIP Overlay Peer (or Peer):
As defined by Willis et al. [8] (Willis, D., Bryan, D., Matthews, P., and E. Shim, “Concepts and Terminology for Peer-to-Peer SIP,” October 2006.), a P2PSIP peer is a node participating in a P2PSIP overlay that provides storage and routing services to other nodes in P2P overlay and is capable of performing several different operations such as joining and leaving the overlay and routing requests within the overlay. We use the term node and peer interchangeably.
P2PSIP Client (or Client):
As defined by Willis et al. [8] (Willis, D., Bryan, D., Matthews, P., and E. Shim, “Concepts and Terminology for Peer-to-Peer SIP,” October 2006.), a P2PSIP client is a node participating in a P2PSIP overlay that provides neither routing nor storage and retrieval functions to that P2PSIP Overlay.
P2PSIP Peer-ID (or Peer Key):
As defined by Willis et al. [8] (Willis, D., Bryan, D., Matthews, P., and E. Shim, “Concepts and Terminology for Peer-to-Peer SIP,” October 2006.), a Peer-ID is an information that uniquely identifies a peer within a given P2PSIP overlay. In DHT-based overlays, this is a hash of the unique node identifier such as an IP address. In unstructured overlays, this is an unhashed unique identifier.
Identifier (ID or Key):
An identifier is an information that uniquely identifies a resource-object, a node, or a service. Peer-ID is a form of identifier. We use the term identifier (ID) and key interchangeably.
Routing Table:
A routing table (finger table in Chord) is used by a node to locate a node that either has a resource-object or offers a certain service such as routing or NAT traversal. In a DHT-based overlay, a node's routing table contains a list of overlay peer-IDs and their IP addresses stored against identifiers that are exponentially away from the node's identifier and thus has a particular structure. Nodes in unstructured networks also maintain a routing table. This specification does not put an upper limit on the number of entries a node can store in its routing table.
Note that the distinction between neighbor and routing table is an artifact of structured peer-to-peer networks.
Neighbor Table:
In a DHT-based overlay, a node's neighbor table (successor list in Chord, leaf-set in Pastry) contains a list of peers which are 'immediately' close to the node's identifier. The purpose of the neighbor table is to maintain correctness. In an unstructured overlay, nodes only maintain a routing table.
Resource-object:
A resource-object is an information a node publishes in an overlay so that other nodes can locate it. Each resource has an ID that may or may not be unique.
Service:
A node can offer different services such as NAT and firewall traversal, and relaying traffic for other nodes. A node advertises about its service capabilities in the overlay. ([Q] Is there a need to differentiate between Resource-object and service?) Different nodes can advertise the same service capabilities.



 TOC 

4.  Design Overview



 TOC 

4.1.  Overall Design Approach

There are three high-level requirements for a peer-to-peer protocol.

Resource publishing and lookup:
The protocol should provide a mechanism for a peer to publish a resource-object or advertise its service and a mechanism to lookup the resource-object and the node offering a service.
P2P network maintenance:
The protocol should provide mechanisms to maintain connectivity and resource availability in a peer-to-peer network.
Heterogenous connectivity:
Nodes should be able to form an overlay in heterogeneous network environments and exchange information about their uptime and capacity.

P2PP allows nodes to form a structured or an unstructured overlay. Nodes that participate in the routing and overlay maintenance decisions are called peers (super-nodes in conventional P2P terminology). Nodes that do not participate in an overlay are called clients (ordinary-nodes in conventional P2P terminology). Clients can use overlay services through peers. Multiple clients can participate in the overlay through one peer. On the other hand, a client can communicate with multiple peers at the same time. Client and peers communicate using the same protocol as peers do, i.e., there is no distinct protocol for client-to-peer and peer-to-peer communication.

P2PP is a request-response protocol. Peers or clients can issue requests which other peers can answer. The transmission of requests and responses is governed by a request and response state machine as defined in Section 5.4 (Message State Machine). The requests can be forwarded in a recursive or an iterative manner. The requests can also be forwarded in a sequential or parallel manner.

The process to discover a peer already in the overlay is outside the scope of this specification. Several techniques such as multicast, cache of nodes during last connection, or an appropriate rendezvous mechanism can be used. Once a node already in the overlay has been discovered, a peer may be admitted into an overlay. The specification does not specify the conditions under which a peer may or may not be admitted into the overlay and leaves policy decisions such as admission control to the implementers of an overlay. This gives P2PP its desired flexibility.

The protocol allows nodes to exchange uptime, CPU and bandwidth utilization, network bandwidth, and network connectivity information with other nodes. However, it does not mandate the exchange of this information. (Open Issue: The meaning of these terms needs to be precisely defined.)

The protocol can be used in different network environments. In an Internet deployment, peers can have heterogeneous connectivity. An overlay designer may configure a policy such that only peers with a public IP address can participate in the overlay. In a corporation deployment, most or all of the nodes may be behind NATs or firewalls which can participate in the overlay. As mentioned earlier, the protocol does not enforce any restriction on which nodes may participate in the overlay and leaves this decision to the overlay implementer.



 TOC 

4.2.  To SIP or not to SIP

A SIP [2] (Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston, A., Petersen, J., Sparks, R., Handley, M., and E. Schooler, “SIP:Session Initiation Protocol,” June 2002.) application can use the P2PP protocol as an alternate method of locating SIP servers as defined in RFC 3263. There are several reasons why we did not choose SIP as the underlying peer-to-peer protocol.

SIP is a session establishment protocol.
SIP is a protocol for initiating, modifying, and terminating interactive sessions [2] (Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston, A., Petersen, J., Sparks, R., Handley, M., and E. Schooler, “SIP:Session Initiation Protocol,” June 2002.). Communicating with other nodes to form and maintain an overlay does not create an interactive session.
SIP is not a general Remote Procedure Call (RPC).
A peer-to-peer protocol designed on top of SIP will merely use it as a remote procedure call (RPC) mechanism. The SIP guidelines document [3] (Rosenberg, J. and H. Schulzrinne, “Guidelines for Authors of Extensions to the Session Initiation Protocol (SIP),” May 2006.) prohibits the use of SIP as a RPC mechanism.
SIP is not a general purpose transfer protocol.
One possible way to design the peer-to-peer protocol is to incorporate it as a SIP message body, possibly in XML, leaving the SIP headers unchanged. The peer-to-peer protocol is, however, unrelated to SIP's operation and the SIP guideline document [3] (Rosenberg, J. and H. Schulzrinne, “Guidelines for Authors of Extensions to the Session Initiation Protocol (SIP),” May 2006.) prohibits sending large amounts of data unrelated to SIP's operation. Such a mechanism would use SIP as a RPC, which, as stated earlier, is prohibited by the SIP guidelines document.
SIP is not a lookup protocol.
SIP is not a lookup protocol; it relies on DNS to locate an appropriate SIP server. Insert and lookup functionality is essential for any peer-to-peer protocol. Using SIP as a peer-to-peer protocol requires integrating lookup functionality into SIP which goes against its design philosophy.
SIP headers overhead.
SIP has headers that are not necessarily needed for a peer-to-peer protocol. Examples of such headers are Call-ID, and Contact.



 TOC 

4.3.  To ASCII or not to ASCII

Perhaps, the first question facing the designer of a new protocol is whether it should be ASCII or binary based. DHTs are a primary motivation for this specification and a message in DHT-based overlay is likely to carry the hash identifier of a peer or resource-object. Using an ASCII-based protocol will require converting binary hash identifier to ASCII and vice versa. This conversion can be avoided if the underlying peer-to-peer protocol is binary based.

Also, if the protocol were to be ASCII-based, existing ASCII-based protocols such as XML-RPC could have been used.



 TOC 

4.4.  P2PP Interface, Messages, and API

An interface is an abstraction which groups together messages that perform a related function.

API provides a mechanism for an application to use overlay services. An API call can be accomplished using one or more P2PP messages.

A message is a P2PP protocol message that performs a certain function.

This specification defines three interfaces for the P2PP protocol namely:

          ^
          |
        P2PP
     Application          +------------+     +-----------+
        Level             |SIP Proxy/UA|     |Application|
          |               +------------+     +-----------+
          V                      |                 |
                 ================|=================|========= P2PP API
                                 |                 |
          ^           +----------------------------------------+
          |           | P2PP: service, data,    +-------------+|
          |           |       diagnostic        | Node State  ||
          |           |       interface         | Maintenance ||
          |           |                         +-------------+|
          |           +----------------------------------------+
          |                      |        |        |
          |           ..........................................
        P2PP          . Transport Layer Security (TLS or DTLS) .
      Transport       ..........................................
        Level                    |        |
          |                   +-----+  +-----+
          |                   | UDP |  | TCP |  ... other
          |                   +-----+  +-----+      protocols
          |                      |        |
          |                .............................
          |                .     IP Layer Security     .
          |                .............................
          V                      |        |        |
=================================|========|========|=============
                                 |        |        |
                        +------------------------------------+
                        |                IP                  |
                        +------------------------------------+

Figure 1: Protocol stack architecture for P2PP

The service interface defines operations for node join, leave, routing-table maintenance and replication. The data interface defines operations for data (resource) insertion, lookup, and removal. The diagnostic interface defines operations for gathering peer statistics such as response-time and relay bandwidth.

The implementer of a peer-to-peer service may be interested in gathering various statistics such as response-time and peer-bandwidth. The diagnostic interface defines operations for gathering these statistics. (Open Issue: Note that some of these statistics can also be gathered using functions defined in the maintenance interface. Thus, it is not entirely clear whether diagnostics should be considered as a separate interface or whether its operations be included in the maintenance interface.)

The state of an overlay is defined is the nodes participating in the overlay. The state changes if a node joins or leaves the overlay. A useful characterization of P2PP messages is whether they change the state of an overlay. Note that only join and leave change the state of the overlay while insert, lookup, remove, routing-table maintenance, and replication do not.

P2PP messages begin with a header followed by a sequence of type-length-value (TLV) objects. Each message is either a request or a response. The response header is the same as a request header except that it contains a response code. P2PP messages cannot be combined in one message if an unreliable transport is used.



 TOC 

4.5.  P2PP and SIP

SIP [2] (Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston, A., Petersen, J., Sparks, R., Handley, M., and E. Schooler, “SIP:Session Initiation Protocol,” June 2002.) is an application-layer protocol that can establish, modify, and terminate multimedia sessions. SIP nodes can either send requests to other SIP nodes directly or through a SIP proxy. A SIP node (user-agent or proxy) typically uses a centralized location service to determine the destination of the request. SIP nodes can avoid the use of a centralized location service by sharing the responsibility of a location service in a distributed way. The use of P2PP allows a SIP node to accomplish this by creating an overlay among participant nodes. P2PP also allows SIP nodes that do not participate in the overlay to use overlay services through participant SIP nodes. Thus, P2PP can be used as P2PSIP peer and client protocols [8] (Willis, D., Bryan, D., Matthews, P., and E. Shim, “Concepts and Terminology for Peer-to-Peer SIP,” October 2006.).



 TOC 

4.6.  Request Routing

A peer can issue the request in a recursive or an iterative manner. In recursive routing, the request traverses the peers until it reaches the peer responsible for the resource-object if it exists. The peer issuing the request has no control over the peers which the request may traverse. In iterative routing, the peer sends a request to another peer which replies with the network address of the peer to which the request should be forwarded. The requesting peer then forwards the request directly to this peer. In this way, the requesting peer has control over the peers through which the request is forwarded.

A request can also be forwarded in an iterative parallel way. The requesting peer can send the request to multiple peers at the same time. The requesting peer MUST designate one request copy as primary.

The choice of using recursive or iterative routing is negotiated at join. The use of recursive or iterative routing is overlay specific; i.e, all nodes in an overlay use either recursive or iterative routing.



 TOC 

4.7.  Transport

A node may decide to send the request over UDP, TCP or TLS over TCP in a recursive or an iterative manner. This section explores various tradeoffs in the use of UDP and TCP for request forwarding.



 TOC 

4.7.1.  Issues in Recursive Routing

Forwarding the requests in a recursive manner over TCP may result in establishing a per hop TCP connection. However, the nodes to whom a peer forwards the request are its routing and immediate neighbors. A peer may choose not to terminate the TCP connection after forwarding the request. It can then reuse those connections for incoming and new requests. Forwarding requests in a recursive manner over UDP requires that each node receiving the request periodically informs the node forwarding the request about its progress. This is necessary because it is the only mechanism for a node forwarding the request to know that the peer it forwarded the request to is still alive. Thus, the use of UDP necessitates the design of a 'semi-reliable' state machine.



 TOC 

4.7.2.  Issues in Iterative Routing

If iterative routing is used, a peer will receive a next hop reply from the peer it forwarded the request. If TCP is used, then the peer issuing the request will need to establish a TCP connection with every next hop peer. (Open Issue: shall the peer immediately close the connection after getting the response in iterative routing?)



 TOC 

4.7.3.  P2PP Message Size

The use of TCP may be preferable because the size of the request or response may exceed UDP MTU size and thus the use of UDP will necessitate the design of a reliable communication layer on top of it. This specification does not provide such a layer and instead recommends the use of TCP or TLS over TCP. However, as described for iterative routing, TCP is not the best choice for request forwarding because of its three-way connection setup latency. One way to address the tradeoff between TCP setup latency and large packet sizes is to reexamine the nature of requests. Examples of requests which may exceed UDP MTU are insert and lookup. For insert, the request can be forwarded over UDP only with the resource-ID of the object and without the object itself until it reaches the peer responsible for the resource-object. The publisher of the resource-object, can then reissue the request, with the complete resource-object, over TCP to the peer responsible for the object.

In a peer-to-peer network, a node is only concerned with its own routing neighbors. It does not necessarily need to know which other peers have this node in their routing tables. Using TCP for overlay maintenance will require that a node not only manages TCP connections for its own routing tables, but also manage incoming TCP connections for other nodes' routing tables. In a fully populated Chord network, a peer will need to maintain logN connections for its routing table. It will also receive logN connections from other peers for their routing tables. Thus, in a fully populated Chord network where nodes use 160-bit hash ID and a log base of two, a peer can maintain 160 TCP connections for its routing table assuming that it does not maintain any backup connection for its ith routing table row and will, on the average, receive 160 connections from other peers. Thus, it will need to maintain 2*logN or 320 TCP connections. (Open Issue: is it a good practice to maintain so many TCP connections? What about many simultaneous connections from one IP address?)

Security may be a concern for some peer-to-peer deployments, and peers may choose TLS over TCP as the underlying transport protocol. Since, the peers can have heterogeneous capabilities, it is an open issue whether using a light-weight TLS-like protocol might be better than TLS over TCP.

A peer receiving the request MUST use the same transport and port to send the response. If the request was sent over UDP, and the response size may exceed MTU, the peer should reply with a 413 Message Too Large response. It is up to the peer issuing the request to send the request over a reliable transport such as TCP.



 TOC 

4.7.4.  Listening Ports

The specification proposes a standard UDP and TCP listening port to be assigned by IANA. The current implementation uses 7080 as UDP and TCP listening port.



 TOC 

4.8.  NAT and Firewall Traversal

It is up to the implementer of the overlay to decide whether there is a need to perform NAT or firewall traversal. If so, then the peers SHOULD implement STUN [5] (Rosenberg, J., Huitema, C., Mahy, R., and D. Wing, “Simple Traversal Underneath Network Address Translators (NAT) (STUN),” October 2006.), TURN [6] (Rosenberg, J., Mahy, R., and C. Huitema, “Obtaining Relay Addresses from Simple Traversal Underneath NAT (STUN),” October 2006.), and ICE [7] (Rosenberg, J., “Interactive Connectivity Establishment (ICE): A Methodology for Network Address Translator (NAT) Traversal for Offer/Answer Protocols,” October 2006.) protocol and advertise this as a service in the overlay. Many peers can advertise NAT traversal service under a particular service name which is stored as a resource-object. One possible option for peers is to choose the same name for NAT and firewall traversal service in which case they will publish their service advertisement at the same peer in the overlay. Node queries looking for a NAT traversal peer will always end up at the same peer. Another option is to use ReDiR mechanism [10] (Rhea, S., Godfrey, B., Karp, B., Kubiatowicz, J., Ratnasamy, S., Shenker, S., Stoica, I., and H. Yu, “OpenDHT: A Public DHT Service and its Uses,” 2005.) which allows distributing service registration over multiple nodes and still do logN lookups. (Open Issue: what are the possible approaches for advertising NAT and firewall traversal service? ReDiR mechanism needs further elaboration.)



 TOC 

4.9.  Node Capabilities

A peer MAY advertise its capabilities such as its link bandwidth, CPU and bandwidth utilization. Advertising link bandwidth and its utilization is useful because it determines whether a peer can be used as a relay either for TURN [6] (Rosenberg, J., Mahy, R., and C. Huitema, “Obtaining Relay Addresses from Simple Traversal Underneath NAT (STUN),” October 2006.) or as a generic relay. It is currently an open issue what is the best way for a node to reasonably estimate these quantities. One option is that nodes can advertise bandwidth estimation service using STUN. The nodes looking to estimate their link bandwidth can then locate the peers advertising bandwidth estimation service.



 TOC 

5.  P2PP Processing Overview



 TOC 

5.1.  Node State

A P2PP peer maintains the following state:

Peer-to-Peer Algorithm:
Peer-to-peer algorithm of the overlay.
Overlay-ID:
ID of the overlay this peer is part of.
Hash-Algorithm:
The hash-algorithm, if any, being used.
Routing:
Whether recursive or iterative routing is being used.
Uptime:
The uptime of this peer.
Routing Table:
A table of other peers in the overlay. For each peer, following information is maintained.
  • Peer-ID: The ID of the peer. For a structured p2p algorithm, it is typically the hash of the IP address of the peer. Otherwise, it can be an IP address or some unique ID.
  • IP address: The IP address of the peer.
  • RTT: The round-trip-time of this peer. If TCP is used, an application may obtain this value from the TCP stack if the underlying OS supports it.
  • Uptime: The uptime of the peer.
Neighbor Table:
A table of peers with ID immediately close to the peer's ID in a structured overlay. For each peer, following information is maintained.
  • Peer-ID: The ID of the peer. For a structured p2p algorithm, it is typically the hash of the IP address of the peer. Otherwise, it can be an IP address.
  • IP address: The IP address of the peer.
  • RTT: The round-trip-time of this peer.
  • Uptime: The uptime of the peer.
Num Neighbors:
Total number of peers in routing and neighbor tables.
Resource Table:
Resource-objects this peer is responsible for.
Replicated Resource Table:
Resource-objects this peer stores as backup for other objects.



 TOC 

5.2.  Request Generation and Response Processing

The request can be issued in a recursive or an iterative manner. A node forwarding a request creates a request transaction and passes it the request to deliver it to the intended recipient. Each transaction is identified by a random and unique 32-bit unsigned integer called transaction-ID. The request is generated and responses are processed according to the following rules:

  1. Requests cannot be combined.(Open Issue: It does not matter over TCP)
  2. The transport of the request SHOULD be preserved, i.e., if the request was received over TCP, it SHOULD be forwarded over TCP.
  3. A peer issuing the request MAY request to receive a copy of the routing and neighbor tables of the peer to whom the request was sent. It does so by setting the request-routing-table and request-neighbor-table flags in the Request-Options object. Considerations apply for recursive and iterative routing.
  4. If the peer issuing the request receives a 480 Alternative Service response, it SHOULD retry the request over TCP directly to the peer which generated this response.
  5. The request SHOULD be sent over TCP if its size exceeds UDP MTU regardless of whether recursive or iterative routing is being used.
  6. The response is matched to the appropriate request transaction by the transaction-ID.

Following considerations apply if the request is being issued in a recursive or an iterative manner.



 TOC 

5.2.1.  Issues in Recursive Routing

The request can be sent over TCP or UDP by the originator. (Open Issue: should only TCP or UDP be used for request forwarding?)

If the request was sent over UDP, the node issuing the request SHOULD always set the in-separate-request flag in the Request-Options object. The reason is that the routing and neighbor tables of the nodes encountered along the path of the request are not likely to fit within the response UDP packet.

If request was sent over TCP and request-routing-table and request-neighbor-table flags were set, each node along the request path adds a copy of its routing or neighbor table in the response (Open Issue: The response size can explode). If the node issuing the request wishes to receive a copy of routing or neighbor table not in the response but in a separate request, it sets the in-separate-request flag in the Request-Options object.

If recursive routing is being used, and a peer receives a response, it consults the transaction-association (TA) table. If it finds a matching entry, it forwards the response to the peer from which it received the request. If request was sent over TCP and request-routing-table and request-neighbor-table flags were set and in-separate-request flags were not set, the peer forwarding the response adds a copy of its own routing or neighbor table and updates the message length in the common header.



 TOC 

5.2.2.  Issues in Iterative Routing

The request SHOULD be sent over UDP if the request size fits within UDP MTU.

If the request was sent in an iterative manner over TCP, the node SHOULD not set the in-separate-request flag.

If the request was sent in an iterative manner over UDP, the node issuing the request MAY set the partial-reply flag in the Request-Options object. If set, the node receiving the request includes in its reply as much routing or neighbor table peers as permitted by the UDP MTU. A node issuing the request MAY also set partial-reply and in-separate-request flags in the Request-Options object. A node receiving the request with these flags set includes in its reply its routing or neighbor entries as permitted by the UDP MTU. It sends the remaining portion of routing or neighbor table in a separate Update request.



 TOC 

5.3.  Request Processing and Response Generation

A request can be received over UDP or TCP. A request received over UDP is processed according to the following steps:

  1. The peer checks the TTL field in the common header. If TTL is equal to zero and the peer cannot give a definitive answer, it MUST reply with a 410 TTL Hops Exceeded response. The response is always sent to the peer from whom the request was received and not to the peer who issued the request.
  2. If TTL is greater than zero, recursive routing was being used, and the peer cannot give a definitive answer for the received request, it determines the next hop from its routing or neighbor tables, creates a new request transaction, and passes it the request after decrementing TTL by one in the common header. The rest of the forwarded request remains unchanged. The peer creates a transaction association (TA) between the response and request transactions. Also, the peer MUST immediately send a 100 Trying response using the same transport and port on which the request was received. (Open Issue: Can we eliminate 100 Trying altogether?)
  3. If TTL is greater than zero, iterative routing was being used, and the peer cannot satisfy the request, it determines the next hop from its routing or neighbor tables. It then replies with a 302 Next Hop response and includes its Peer-Info and the next hop Peer-Info object in the response.
  4. If a peer receiving the request determines that it can give a definitive answer which will fit within UDP MTU, it checks if the Request-Options object was present and that request-routing-table or request-neighbor-table and in-separate-request flags were set. If set, the peer waits for a random time between T2 and T3 seconds, and sends a copy of its routing or neighbor table or both in an Update request to the peer which issued the request. If in-separate-request was not set, the peer generating a definite response includes in the response as much of the requested routing or neighbor table entries as permitted by the UDP MTU.
  5. If the peer receiving the request determines that it can give a definitive answer but the response will not fit within UDP MTU, it can reply with a 413 Message Too Large or a 480 Alternative Service response to indicate to the node issuing the request to retry the request over TCP.
  6. The peer checks if it can update its routing and neighbor tables from the received request.

A request received over TCP is processed according to the following steps:

  1. The peer checks the TTL field in the common header. If TTL is equal to zero and the peer cannot give a definitive answer, it MUST reply with a 410 TTL Hops Exceeded response. The response is always sent to the peer from whom the request was received and not to the peer who issued the request.
  2. If TTL is greater than zero, recursive routing was being used, and the peer cannot give a definitive answer for the received request, it determines the next hop from its routing or neighbor tables, creates a new request transaction, and passes it the request after decrementing TTL by one in the common header. The rest of the forwarded request remains unchanged. The peer creates a transaction association (TA) between the response and request transactions.
  3. If TTL is greater than zero, iterative routing was being used, and the peer cannot satisfy the request, it determines the next hop from its routing or neighbor tables. It then replies with a 302 Next Hop response and includes its Peer-Info and the next hop Peer-Info objects in the response.
  4. If a peer receiving the request determines that it can give a definite answer, it checks if the Request-Options object was present and that request-routing-table or request-neighbor-table and in-separate-request flags were set. If set, the peer waits for a random time between T2 and T3 seconds, and sends a copy of its routing or neighbor table or both in an Update request to the peer which issued the request. If in-separate-request was not set, the peer includes a copy of its routing and neighbor table in the response.



 TOC 

5.4.  Message State Machine

(Open Issue: The state machines for reliable and unreliable transports need to be combined.)



 TOC 

5.4.1.  Request State Machine

The request transaction state machine for unreliable transports is shown in Figure 1.

The "Trans_Req" state is entered when a peer issues or forwards a request. When entering this state, the transaction SHOULD set timer T2 to fire in T1 seconds. It SHOULD initialize variable n to zero. If Timer T2 fires and n is less than the maximum allowed retransmissions N, the request is retransmitted, n is incremented and timer T2 is set to fire in (2^n * T1) seconds. If timer T2 fires and n is equal to N, the state machine transitions to "Failed" state and is terminated.

If a 2xx-4xx response is received, the transaction enters "Success" state and is terminated.

                      +-----------+
                      |           |
                      |  Initial  |
                      |           |
                      +-----------+
                            |
                            | tx_Request / set Timer T2
Timer T2 fires  /           |
If n != N                   V              Transport Err. or
   reset T2           +------------+       Timer T2 fires and
   tx_Request  +------|            |       n == N
               |      | Trans_Req  |---------------+
               +----->|            |               |
                      +------------+               |
                            |                      |
                            | rx_Resp (2xx-4xx) /  |
                            | Pass to app.         |
                            V                      |
                      +-----------+                |
                      |           |                |
                      |  Success  |                |
                      |           |                |
                      +-----------+                |
                                                   |
                                                   |
                      +-----------+                |
                      |           |                |
                      |  Failure  |<---------------+
                      |           |
                      +-----------+

Figure 2: request transaction for unreliable transports

The request transaction state machine for reliable transports is shown in Figure 2.

The "Trans_Req" state is entered, when a peer issues or forwards the request. When entering this state, it SHOULD set timer T5. The value of T5 is application dependent. If timer T5 fires, the state machine transitions to "Failed" state and is terminated. If a 2xx-4xx response is received, the state machine transitions to "Success" state and is terminated. If the request was forwarded in a recursive manner, the application MUST not terminate the reliable-transport connection. If the request was forwarded in an iterative manner, an application MAY terminate the reliable-transport connection if it does not anticipate its reuse.

                      +-----------+
                      |           |
                      |  Initial  |
                      |           |
                      +-----------+
                            |
                            | tx_Request / set Timer T5
                            |
                            V
                      +------------+       Transport Err. or
                      |            |       Timer T5 fires
                      | Trans_Req  |---------------+
                      |            |               |
                      +------------+               |
                            |                      |
                            | rx_Resp (2xx-4xx) /  |
                            | Pass to app.         |
                            V                      |
                      +-----------+                |
                      |           |                |
                      |  Success  |                |
                      |           |                |
                      +-----------+                |
                                                   |
                                                   |
                      +-----------+                |
                      |           |                |
                      |  Failure  |<---------------+
                      |           |
                      +-----------+

Figure 3: request transaction for reliable transports



 TOC 

5.4.2.  Response State Machine

There is no state machine for responses. Each request generates a response.

If recursive routing is being used and the request is received over an unreliable transport, a 100 Trying response SHOULD be sent.



 TOC 

5.5.  Timers

This section defines timers for request and response state machines.

Timer    Value
----------------------
T0       5s
T1       500ms
T2       T0
T3       2*T0

T5       T0 (Application dependent)
T6       T0 (Application dependent)



 TOC 

6.  Message Formats

All P2PP messages begin with a common header, followed by a sequence of type-length-value (TLV) objects. This section describes various P2PP messages and their contents at a high level in ABNF [4] (Crocker, D. and P. Overell, “Augmented BNF for Syntax Specifications:ABNF,” October 2005.).

We define seven messages for service interface namely, join, leave, keep-alive, routing-peer-lookup, update, query, replicate and three messages for data interface namely, insert, lookup and remove.

P2PP-Msg = Join / Leave / Keep-alive /
           Routing-Peer-Lookup / Update / Query / Replicate /
           Insert / Lookup / Remove

A node (client) which does not participate in an overlay does not need to implement all of these messages. Client-to-peer messages are independent of the p2p algorithm being used. The peer-to-peer messages are also independent of the p2p algorithm being used because a peer locally determines the next hop after consulting its routing or neighbor tables. Note that peers in an overlay only run one type of p2p algorithm such as Chord, Pastry, Kademlia or Gnutella and there is no mixing of the p2p algorithm, i.e., it is not possible for some nodes in the same overlay to only use Chord and some nodes to only use Pastry. A peer can simultaneously participate in two different overlays.



 TOC 

6.1.  Join

A node sends a join request to a peer already in the overlay to insert itself in the network. The mechanism to discover a peer already in the p2p network is independent of any particular p2p algorithm being used. The possible neighbors of this new node MUST be informed about the join request.

A node MUST send a Query request to a discovered node before a join request. The Query request is used to determine overlay parameters such as overlay-ID, peer-to-peer and hash algorithms, and request routing method (recursive vs. iterative) being used.

The node then constructs a join request and passes it to the request transaction. The node that wishes to participate in the overlay MUST set the 'insert' flag in the Request-Options object. The node that does not wish to participate in the overlay MUST NOT set the 'insert' flag in the overlay. The peer receiving a join request from a client does not forward it to other peers in the overlay and immediately returns a 200 Ok response.

A peer MAY request to receive a copy of routing and neighbor tables from the nodes which receive the join request. It does so by following the rules defined in Section 5.2 (Request Generation and Response Processing)

Join = Common-header
     Request-Options
     Peer-Info

If a node receiving the join request is not the immediate neighbor of this joining peer, and recursive routing is being used, it forwards the request to a suitable peer by following the rules defined in Section 5.3 (Request Processing and Response Generation).

A peer receiving an iterative join request sends a 302 Next Hop response about the next hop if the joining peer will not be its immediate neighbor.

If a peer receiving the request determines that the joining node will be its neighbor, it MUST send a 200 Ok response to the node from which it received the request. The response contains the time before which the joining peer SHOULD send a keep-alive message. The response SHOULD also contain a copy of the neighbor table of the peer.

Once the node issuing the join request receives a 200 OK response, it checks the received neighbor table. It then sends a join request with the S flag set in the Request-Options object to each of the neighbors. It is really up to the peer generating the 200 OK response to decide, based on the p2p algorithm being used, which neighbors (successors, predecessors) a joining peer may contact to insert itself appropriately in the overlay. The neighbors of this newly joining peer SHOULD update their neighbor and routing tables appropriately. After the receipt of the join request, the neighbors wait for a random time before exchanging their neighbor tables.

Peers in the overlay MAY decide to reject a join request with a 407 Request Rejected response because it does not satisfy the overlay policy. NAT and node capacity examples of such policies. This specification does not specify any policy for the node admission and defers this decision to the overlay implementer.

Peers in the overlay MAY not be willing to admit a new node with no history about its uptime. If so, they reply with a 407 Join Request Deferred response and an Expires object after which the peer can reissue the join request. The peer receiving this response SHOULD resend the join request with 'insert' flag reset in the Request-Options object to insert itself as a client in the overlay. It SHOULD retry the join request after the passage of time in the Expires object.

Join (Resp) = Common Header
            Response-code
            Peer-Info
            [Peer-Info]
            [Expires]
            [Routing-table]
            [Neighbor-peers]

Routing-table = Num-entries
              [Peer-Info]*

Neighbor-peers = Num-entries
               [Peer-Info]*



 TOC 

6.2.  Leave

A peer sends a leave request to gracefully inform its neighbors about its departure. The neighbors MUST update their neighbor pointers and take over the peer-IDs and resource-objects the leaving node is responsible for.

A peer SHOULD send the leave message over TCP as the size of peer-IDs and resource-objects is likely to exceed MTU. However, a peer SHOULD avoid a new TCP connection to send the leave message.

A peer receiving a leave request MUST immediately respond with a 200 Ok.

A client MAY also send a leave message to inform its routing peers about its impending departure.

(Open Issue: The user should not have to wait for the p2p application to shut down, which in turn is waiting for the receipt of leave response.)

Leave = Common-header
      Peer-Info
      [Peer-Info]*
      [Resource-object]*
Leave (Resp) = Common-header
             Peer-Info


 TOC 

6.3.  Keep-alive

A peer sends a periodic keep-alive message to its routing and immediate peers. A peer MUST send keep-alive to its neighbor and routing table peers before the expiry of their refresh interval. The two immediate neighbors do not need to send a periodic keep-alive message to each other and can use various heuristics for keep-alive timer duration such as randomly sending a keep-alive within an interval. Also, it is possible that a peer may receive a request from its neighbors before the expiry of the keep-alive timer. The arrival of the request SHOULD be considered as the receipt of a keep-alive message and the expiry timers SHOULD be updated accordingly.

If a neighbor fails, a peer may have to immediately find a new neighbor to ensure lookup correctness using Peer-Lookup request. If a routing entry fails, a node may choose to repair it immediately or defer till a lookup request arrives.

Keep-alive can be considered as a lookup message for an existing entry in the routing table. Keep-alives SHOULD be sent over UDP. (Open Issue: Should Keep-alives be sent when a node has a TCP connection with its neighbor and routing peers?)

Keep-alive = Common-header
           Peer-Info
           Expires

A peer receiving a keep-alive request responds with a 200 Ok response and includes an Expires object. The request originator MUST send another request before the expiration of this time.

Keep-alive (Resp) = Common-header
                  Peer-Info
                  Expires


 TOC 

6.4.  Peer-Lookup

A peer MAY send a Peer lookup message to update its routing or neighbor table entries. For DHTs, a peer lookup is different from an ordinary lookup message because the requesting peer is seeking a peer(s) whose peer-ID may immediately succeed or precede a certain ID or may lie within a range of IDs.

Peer-Lookup = Common-Header
            Peer-Info
            PLookup

A peer receiving an iterative routing-peer lookup request will send a 302 Next Hop response there is another peer with an ID close to the peer-ID being searched for.

Peer-Lookup (resp) = Common-Header
                   Peer-Info
                   [Peer-Info]


 TOC 

6.5.  Update

A peer sends an update request to another peer to request a copy of its neighbor or routing table peers. A peer SHOULD send an update request over TCP since the size of the routing table is likely to exceed the MTU size.

Update = Common-Header
       Peer-Info
Update (resp) = Common-Header
              Peer-Info
              [Routing-table]
              [Neighbor-peers]


 TOC 

6.6.  Query

A peer sends a query request to any peer in the overlay requesting information about the overlay algorithm, the hash algorithm (if any), key size, and overlay-ID. A peer MUST send the query request to an existing peer in the p2p network before sending a join request.

Query = Common-Header
      Peer-Info
Query (resp) = Common-Header
             Peer-Info
             P2P-options

P2P-options = Hash-algorithm
           Hash-algorithm-length
           DHT-algorithm
           OverlayID-length
           Overlay-ID


 TOC 

6.7.  Replicate

A peer can replicate its resource-objects using Replicate request. The choice of nodes on which the resource-objects should be replicated is left to the implementer of the overlay. Heuristics such as replicate to the next k neighbors of this node can be used.

A node may also need to replicate its resource-objects when its neighbors are updated. The replicate request SHOULD be sent over TCP.

Replicate = Common-header
            Peer-Info
            [Peer-Info]*
            [Resource-object]*

Replicate (Resp) = Common-header
                 Peer-Info


 TOC 

6.8.  Insert (put)

A peer sends an insert request to a peer already in the p2p network to insert a resource-object. The insertion operation involves locating the peer responsible for the resource-object and then inserting either the network address of the resource-object publisher or the resource-object itself. The insert operation is different from the join operation in the sense that it does not change the p2p network geometry. The insert operation can also be used to update an existing resource-object.

The publisher of the resource-object MUST also include information about its owner in the insert request because the publisher is not necessarily the owner of the resource-object. It is possible that multiple owners may wish to publish or update an existing resource-object and a peer may want to retrieve part of the resource-object inserted or updated by a certain owner. (Open Issue: What is the format of the owner object?)

The semantics of the resource-object are managed by the application. P2PP only requires each resource to have a resource-ID.

A peer MAY send the insert request over TCP or UDP. A peer may use UDP if the size of the request is less than MTU. The size of the request primarily depends on the size of resource-object. If the size of the request exceeds MTU, a peer MAY NOT include the resource-object in the request. If the size of resource-object is greater than MTU, the request SHOULD be sent over TCP. However, the request may traverse a number of hops and sending the request over TCP may incur a per-hop TCP connection setup latency in the worst case. A solution is to send an insert request with the resource-object TLV only containing the resource-ID, over UDP. When the peer responsible for the resource-object is found, it replies with a 480 Alternative Service response. The querier on receiving this response sends the insert request containing the resource-object over TCP to the peer responsible for the resource-object.

Insert = Common-header
         Peer-Info
         Request-Options
         Expires
         [Owner]
         Resource-object
Insert (Resp) = Common-header
         Response-code
         Expires


 TOC 

6.9.  Lookup (get)

A peer issues a lookup request to retrieve a resource-object from the p2p network. It includes the Resource-object TLV in the request and sets the I flag to zero.

The resource-object size may exceed UDP MTU, if the request was sent over UDP. The node responsible for the resource-object replies with a 480 Alternative Service response. The querier, on receiving this response, reissues the lookup request over TCP to the peer responsible for the resource-object.

Lookup = Common-header
       Peer-Info
       Request-Options
       [Owner]
       Resource-object

If the resource-object is found, the peer responsible for it replies with a 200 Ok response. If a peer receiving a lookup request determines that it cannot forward the request, it generates a 404 Not Found response.

Lookup (resp) = Common-header
              Peer-Info
              Resource-object


 TOC 

6.10.  Remove

A node sends the remove request to remove the resource-object before its lifetime expiration.

Remove = Common-header
       Peer-Info
       [Owner]
       [Resource-object]

Remove (Resp) = Common-header
              Peer-Info


 TOC 

7.  Bit-Level Formats and Errors

This section defines PDUs for the messages and their methods defined above.



 TOC 

7.1.  Common Header

This header begins all P2PP messages. It has a fixed format, as shown below.

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| V=3 |T|       Reserved        | Request Type  |      TTL      |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                       Message Length                          |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                       Transaction-ID                          |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Version (3 bits):
The protocol version number. The current version is 1.
T flag:
If set (T=1), the message is a request. Otherwise, it is a response.
Request Type (8 bits):
The request message type such as join and leave.
TTL (8 bits):
A hop count for the number of peers this request can traverse. (Open Issue: Upper limit of 256 maybe too stringent. If TTL becomes zero, a node replies with a 410 TTL hops exceeded response. The request can then be reissued to the node generating 410 response.)
Message Length (32 bits):
The byte length of the message after the common header itself.
Transaction-ID (32 bits):
A unique number to match responses with the originated requests.

For responses, the rightmost 10-bits of the Reserved field contain the response code.

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| V=3 |T|r|r|   Response code   | Request Type  |      TTL      |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                       Message Length                          |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                       Transaction-ID                          |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+



 TOC 

7.2.  General Object Format

Each P2PP object begins with a fixed header giving the object type and length. This is followed by the object Value.

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|A|B|r|r|  Object-Type  |r|r|r|r|           Length              |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
//                          Value                              //
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

AB=00 "(Mandatory)":
If the object is not understood, the entire message containing it MUST be rejected with an "Object Type Error" message with sub code 1 ("Unrecognized Object").
AB=01 "(Ignore)":
If the object is not understood, it MUST be deleted and the rest of the message processed as usual.
Object-Type (8 bits):
An IANA-assigned identifier for the type of the object.
Length (16 bits):
The byte length of the object.

The combination AB=10 and AB=11 are reserved.



 TOC 

7.3.  P2PP TLV Objects



 TOC 

7.3.1.  Peer-Info

Type:
Peer-Info
Length:
Variable (depends on the length of Peer-ID, IP version and unhashed key if any. It can include Uptime, Neighbor/resource-utilization, and NAT and firewall TLV objects.)

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
//                          Peer-ID                            //
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
:                     Additional Information                    :
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Peer-ID (variable):
The fixed-length output of the hash function negotiated at join. If the Algorithm-Len field queried at join is zero, a length field of 16 bits MUST be present before Peer-ID. Unhashed-ID TLV object MUST NOT be included.
Additional Fields:
Various additional fields that may be useful for a particular network environment. Their inclusion is left to the overlay implementer.



 TOC 

7.3.1.1.  Additional Information Fields

Unhashed ID:
 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
//                        Unhashed-ID                          //
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Unhashed-ID (variable):
Unhashed ID of this peer. This is only included in a DHT-based overlay. If the unhashed-ID is the same as IP-address of the peer, it SHOULD NOT be included.

Uptime:
 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                           Uptime                              |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Uptime (32 bits):
The uptime of this peer in number of seconds.

Resource-Info:
 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|     Number of Neighbors       |   CPU Util.   |   BW. Util.   |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                        Peer bandwidth                         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Number of Neighbors (16 bits):
The number of routing and immediate neighbors of this peer.
CPU Utilization (8 bits):
CPU Utilization of this peer on a scale of 1 to 100.
Bandwidth Utilization (8 bits):
Bandwidth utilization of this peer on a scale of 1 to 100.
Peer Bandwidth (32 bits):
Estimated peer bandwidth in kilo bits per second. This is tricky to determine and is TBD.

Address-Info:
 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|IP-Ver |  Num  |  HT  |Reserved|           Port                |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
//                         Peer address                        //
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
IP-Ver (4 bits):
The IP version number, 4 or 6.
Num (4 bits):
Number of peer IP address, port, transport and address type 4-tuples gathered using Interactive Connectivity Establishment (ICE) [7] (Rosenberg, J., “Interactive Connectivity Establishment (ICE): A Methodology for Network Address Translator (NAT) Traversal for Offer/Answer Protocols,” October 2006.).
HT (4 bits):
The address type of the peer as defined in ICE [7] (Rosenberg, J., “Interactive Connectivity Establishment (ICE): A Methodology for Network Address Translator (NAT) Traversal for Offer/Answer Protocols,” October 2006.). One of host (0000), server reflexive (0001), peer reflexive (0010), or relayed candidate (0011).
Peer Address (variable):
The IP address of the peer. Its length depends on the IP-Ver field.
Port (16 bits):
The port on which this peer listens for requests.



 TOC 

7.3.2.  Request-Options

Type:
Request-Options
Length:
Fixed (32-bit word)

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|I|P|R|N|E|A|S|                                                 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

I (1 bit):
If set (I=1), then insert this peer in the overlay at join. A P2PSIP client MUST set this bit to zero.
P (1 bit):
If set (P=1), designate one copy as primary for parallel lookups.
R (1 bit) request-routing-table:
If set (R=1), send a copy of the routing table to the peer issuing the request. The transmission of routing-table copy is governed by the in-separate-request and partial-copy flags.
N (1 bit) request-neighbor-table:
If set (N=1), send a copy of the neighbor table to the peer issuing the request using the mechanism described for routing-table.
E (1 bit) in-separate-request:
If set (E=1), and if P or R are also set, the peer is requesting to receive routing or neighbor table in an Update request. If not set (E=0), and the request was received over TCP, each peer along the request path can add a copy of its routing or neighbor table before forwarding the response.
A (1 bit) partial-reply:
If set (A=1), the peer generating the definite response sends a copy of the routing or neighbor table as determined by the P and N flags in its response as permitted by the UDP MTU. If E (in-separate-request) is also set, the rest of the routing or neighbor table is sent in a separate Update request.
S (1 bit):
If set (S=1) and if I flag is also set, the request is being sent to the immediate neighbors of the newly joining peer.



 TOC 

7.3.3.  P2P-Options

Type:
P2P-Options
Length:
Variable (depends on the length of Overlay-ID)

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|R|r| Algorithm | Algorithm-Len | P2P-Algorithm | OverlayID-Len |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
//                          Overlay-ID                         //
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

R flag:
If set (R=1), the peers in the overlay use recursive request forwarding.
Algorithm (6 bits):
An IANA-assigned identifier for the hash algorithm.
Algorithm-Length (8 bits):
The byte length of the hash algorithm. If set to zero, then no hash algorithm is used.
P2P-Algorithm (8 bits):
An IANA-assigned identifier for the P2P algorithm being used.
OverlayID-Length (8 bits):
The byte length of overlay-ID.
Overlay-ID (variable):
Overlay-ID.



 TOC 

7.3.4.  Routing-Table

Type:
Routing-Table
Length:
Variable (depends on the number of Peer-Info objects)

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|  Num entries  |                                               |
+-+-+-+-+-+-+-+-+               [Peer-Info]+                    +
//                                                             //
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Num-entries (8 bits):
Number of Peer-Info objects in the routing table.
Peer-Info (variable):
One or more Peer-Info objects.



 TOC 

7.3.5.  Neighbor-table

Type:
Neighbor-Table
Length:
Variable (depends on the number of Peer-Info objects)

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|  Num entries  |                                               |
+-+-+-+-+-+-+-+-+               [Peer-Info]+                    +
//                                                             //
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Num-entries (8 bits):
Number of Peer-Info objects in the neighbor table.
Peer-Info (variable):
One or more Peer-Info objects.



 TOC 

7.3.6.  PLookup

Type:
PLookup
Length:
Variable (depends on the length of Peer-ID and whether this is a range lookup)

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|   Num   |r|r|R|       Peer-ID (equal to Algorithm-Len)       //
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
//                             Peer-ID                         //
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Num (5 bits):
Number of peers to look for.
R (1 bit):
If set (R=1), then it is a range lookup.
Peer-ID (variable):
Peer-ID. If Algorithm-Len queried at join is zero, then a length field of 16 bits MUST be present before Peer-ID.
Peer-ID (variable):
Peer-ID. Included only if R is set. If Algorithm-Len queried at join is zero, then a length field of 16 bits MUST be present before Peer-ID.



 TOC 

7.3.7.  Resource-Object

Type:
Resource-Object
Length:
Variable (depends on the size of resource-object)

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|I| Reserved  |            Resource-ID                         //
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
//                      Resource-object                        //
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

I flag:
If set (I=1), the resource-object is included.
Resource-ID (variable):
The ID of the resource object. For DHT-based overlays, its length is equal to the length of the hash algorithm. If Algorithm-Len field negotiated at join is zero, a length field of 16-bits must precede resource-ID.
Resource-object (variable):
Variable length.



 TOC 

7.3.8.  Expires

Type:
Expires
Length:
Fixed (32-bit word)

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                   Expire time in seconds                      |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Expires (32 bits):
Time in seconds



 TOC 

7.3.9.  Owner

Type:
Owner
Length:
Variable

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
//                            Owner                            //
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Owner (variable):
The owner of the Resource-Object. Format TBD.



 TOC 

7.3.10.  Credentials

TBD. Need an elaborate credentials mechanism.

Type:
Credentials
Length:
Variable

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
//                         Credentials                         //
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Credentials (variable):
The credentials of the peer sending the join request. TBD.



 TOC 

7.4.  Errors



 TOC 

7.4.1.  Error Object

Type:
Error
Length:
Variable

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|           Link MTU            |           Reserved            |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                 Additional Information                        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Link MTU (16 bits):
The MTU for a link along which a message cannot be sent. Other error info TBD.



 TOC 

7.4.2.  Error Codes

There are four different types of error codes. They are:

1xx (Provisional):
Response data which provides an update on the progress of the request. These responses are only sent when request is sent over an unreliable transport in a recursive manner.
2xx (Success):
Response data which indicates that the request has been processed successfully in some sense.
3xx (Redirect):
Response data which indicates that the request should be redirected.
4xx (Request Failure):
Response data which indicates that the request has failed.

2xx-4xx responses are classified as definite responses while 1xx are provisional responses.



 TOC 

7.4.2.1.  1xx (Provisional) Responses

100 Trying.
The request is being tried. This response is only sent if the request was sent in a recursive manner over UDP, and needs to be forwarded to the next peer.



 TOC 

7.4.2.2.  2xx (Successful) Responses

200 Successful.
A successful answer to the request.



 TOC 

7.4.2.3.  3xx (Redirect) Responses

302 Next Hop.
This response is only generated for iterative requests if the peer receiving the request is not the final destination for the request.



 TOC 

7.4.2.4.  4xx (Permanent-Failure) Responses

400 Bad request.
There was an error parsing the request.
404 Not found.
This response is generated for a lookup request if the resource-object being searched for is not found.
405 Error inserting object.
There was an error inserting the resource-object.
406 Request rejected.
The request was understood but rejected by the peer.
407 Join request deferred.
The peer issuing the join request should retry after a certain time.
410 TTL hops
The number of TTL hops exceeded.
413 Message too large.
The response message size was too large. This response is typically generated for unreliable transports.
418 Timeout.
The request timed out. This response is generated when request was forwarded in a recursive manner over UDP and no response was received.



 TOC 

8.  API between P2PP and Applications

An application using peer-to-peer protocol issues non-blocking calls to P2PP layer to accomplish the following operations:



 TOC 

8.1.  Query

Query([out]Overlay-ID, [out]P2P-Algorithm, [out]Hash-Algorithm, [out]Hash-Algorithm-Len, [in]Overlay-peer-network-address, [in]Overlay-peer-port)



 TOC 

8.2.  Join

Join([in]Overlay-ID, [in]Overlay-peer-network-address, [in]Overlay-peer-port)



 TOC 

8.3.  Leave

Leave([in]Overlay-ID)



 TOC 

8.4.  Insert

Insert([in]Resource-ID, [in]Resource-Object)



 TOC 

8.5.  Remove

Remove([in]Resource-ID)



 TOC 

8.6.  Lookup

Lookup([in]Resource-ID, [out]Resource-Object)

(Open Issue: Each operation should have a crypto-signed counterpart.)



 TOC 

9.  Security Considerations

TBD.



 TOC 

10.  Open Issues

  1. Should recursive parallel routing be supported?
  2. Peers can invite other peers in the overlay. Should this mechanism be part of the P2PP messages or should this be a separate mechanism?
  3. How can a node estimate its CPU and bandwidth utilization?
  4. Security
  5. Should trust mechanism be part of the protocol?
  6. Should client Peer-Info objects be inserted in the overlay?
  7. Should diagnostic interface be merged with maintenance interface?



 TOC 

11.  Acknowledgements

The authors will like to thank the following (in alphabetical order) for their helpful comments and suggestions. Christian Dickmann, Jae W. Lee, Kundan Singh, and Eunsoo Shim.



 TOC 

12.  IANA Considerations

Listening Port:
The port on which a peer listens for request. The current implementation uses 7080.

Message Types:
The P2PP common header contains a one byte message type field. A message can either be a request or a response. The following values are allocated by this specification for request messages.

Message TypeMessage
0 Join
1 Leave
2 Keep-alive
3 Peer-Lookup
4 Update
5 Query
6 Replicate
21 Insert
22 Lookup
23 Remove

Object Types:
There is a one byte field in the object header. The following values for object types are defined by this specification.

OTypeObject Type
0 Peer-Info
1 P2P-options
2 Routing-table
3 Neighbor-table
4 PLookup
5 Resource-object
6 Expires
7 Request-Options
8 Owner
9 Credentials
10 Uptime
11 Unhashed-ID
12 Capacity-Info
13 Address-Info
14 Error

Algorithm:
There is a one byte algorithm field in the P2P-options object which defines the hash function being used. The following values are allocated by this specification for this field.

ATypeAlgorithm
0 None
1 SHA1
2 SHA-256
3 SHA-512
4 MD4
5 MD5

Algorithm:
There is a one byte P2P-algorithm field in the P2P-options object which defines the p2p algorithm being used. The following values are allocated by this specification for this field.

PTypeP2P-Algorithm
0 Chord
1 CAN
2 Kademlia
3 Pastry
4 Bamboo
5 Tapestry
6 Accordion
7 SkipNet
8 Mercury
9 Gia


 TOC 

13.  References



 TOC 

13.1. Normative References

[1] Bradner, S., “Key words for use in RFCs to Indicate Requirement Levels,” BCP 14, RFC 2119, March 1997.
[2] Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston, A., Petersen, J., Sparks, R., Handley, M., and E. Schooler, “SIP:Session Initiation Protocol,” RFC 3261, June 2002.
[3] Rosenberg, J. and H. Schulzrinne, “Guidelines for Authors of Extensions to the Session Initiation Protocol (SIP),” RFC 4485, May 2006.
[4] Crocker, D. and P. Overell, “Augmented BNF for Syntax Specifications:ABNF,” RFC 4234, October 2005.
[5] Rosenberg, J., Huitema, C., Mahy, R., and D. Wing, “Simple Traversal Underneath Network Address Translators (NAT) (STUN),”  draft-ietf-behave-rfc3489bis-05 (work in progress), October 2006.
[6] Rosenberg, J., Mahy, R., and C. Huitema, “Obtaining Relay Addresses from Simple Traversal Underneath NAT (STUN),”  draft-ietf-behave-turn-02 (work in progress), October 2006.
[7] Rosenberg, J., “Interactive Connectivity Establishment (ICE): A Methodology for Network Address Translator (NAT) Traversal for Offer/Answer Protocols,”  draft-ietf-mmusic-ice-11 (work in progress), October 2006.


 TOC 

13.2. Informative References

[8] Willis, D., Bryan, D., Matthews, P., and E. Shim, “Concepts and Terminology for Peer-to-Peer SIP,” draft-willis-p2psip-concepts-03 (work in progress), October 2006.
[9] Bryan, D. and C. Jennings, “A P2P Approach to SIP Registration and Resource Location,” draft-bryan-sipping-p2p-03 (work in progress), October 2006.
[10] Rhea, S., Godfrey, B., Karp, B., Kubiatowicz, J., Ratnasamy, S., Shenker, S., Stoica, I., and H. Yu, “OpenDHT: A Public DHT Service and its Uses,” SIGCOMM '05:Proceedings of the 2005 conference on Applications, technologies, architectures, and protocols for computer communications.  Philadelphia, PA, pp. 73-84, 2005.
[11] Pugh, W., “Skip Lists: A Probabilistic Alternative to Balanced Trees,” Workshop on Algorithms and Data Structures.  pp. 437-449, 1989.
[12] Stoica, I., Morris, R., Liben-Nowell, D., Karger, D., Kaashoek, M., Dabek, F., and H. Balakrishnan, “Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications,” IEEE/ACM Transactions on Networking, vol. 11, no. 1, pp. 17-32, 2003.
[13] Ratmasamy, S., Francis, P., Handley, M., Karp, R., and S. Shenker, “A Scalable Content-Addressable Network,” SIGCOMM '01: Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications.  San Diego, CA, pp. 161-172, August 2001.
[14] Rowstron, A. and P. Druschel, “Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems,” Proceedings of the 18th IFIP/ACM International Conference on Distributed Systems Platforms (Middleware 2001), March 2002.
[15] Maymounkov, P. and D. Mazieres, “Kademlia: A Peer-to-Peer Information System Based on the XOR Metric,” IPTPS'01: Revised Papers from the First International Workshop on Peer-to-Peer Systems London, UK, pp. 53-65, March 2002.
[16] Karger, D., Lehman, E., Leighton, T., Panigraphy, R., Levine, F., and D. Lewin, “Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web,” STOC '97: Proceedings of the twenty-ninth annual ACM symposium on Theory of computing , 1997.
[17] Balakrishnan, H., Kaashoek, F., Karger, D., Morris, R., and I. Stoica, “Looking up data in P2P systems,” Communications of the ACM, vol. 46, no. 2, pp. 43-48, 2003.
[18] Dabek, F., Zhao, B., Druschel, P., Kubiatowicz, J., and I. Stoica, “Towards a Common API for Structured Peer-to-Peer Overlays,” IPTPS'03: Proceedings of the 2nd International Workshop on Peer-to-Peer Systems.  Berkeley, California, February 2003.
[19] Chawathe, Y., Ratnasamy, S., Breslau, L., Lanham, N., Kaashoek, M., and S. Shenker, “Making gnutella-like p2p systems scalable,” SIGCOMM '03: Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications. New York, NY, USA, pp. 407-418, 2003.
[20] Li, J., Stribling, J., Morris, R., and M. Kaashoek, “Bandwidth-efficient management of DHT routing tables,” in Proceedings of the 2nd USENIX Symposium on Networked Systems Design and Implementation (NSDI '05).  Boston, MA, USA, 2005.


 TOC 

Appendix A.  Background

This section gives a background of various structured and unstructured peer-to-peer algorithms.



 TOC 

A.1.  Structured Overlays

A structured peer-to-peer network is formed when links between nodes follow a certain pattern or structure. Distributed hash tables (DHTs) [12] (Stoica, I., Morris, R., Liben-Nowell, D., Karger, D., Kaashoek, M., Dabek, F., and H. Balakrishnan, “Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications,” .)[13] (Ratmasamy, S., Francis, P., Handley, M., Karp, R., and S. Shenker, “A Scalable Content-Addressable Network,” August 2001.)[14] (Rowstron, A. and P. Druschel, “Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems,” March 2002.)[15] (Maymounkov, P. and D. Mazieres, “Kademlia: A Peer-to-Peer Information System Based on the XOR Metric,” March 2002.) are examples of structured peer-to-peer networks. DHTs share a fundamental function, namely routing a message to a node responsible for an identifier (key) in O(log_{b}N) steps using a certain routing metric where N is the number of nodes in the system and b is the base of the logarithm.

In this section, we give an overview of Chord [12] (Stoica, I., Morris, R., Liben-Nowell, D., Karger, D., Kaashoek, M., Dabek, F., and H. Balakrishnan, “Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications,” .), Pastry [14] (Rowstron, A. and P. Druschel, “Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems,” March 2002.) and Kademlia [15] (Maymounkov, P. and D. Mazieres, “Kademlia: A Peer-to-Peer Information System Based on the XOR Metric,” March 2002.). These algorithms are based on the idea of consistent hashing [16] (Karger, D., Lehman, E., Leighton, T., Panigraphy, R., Levine, F., and D. Lewin, “Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web,” 1997.) i.e., keys are mapped onto nodes by a hash function that can be resolved by any node in the system via queries to other nodes and the arrival or departure of a node does not require all keys to be rehashed. We compare their algorithm-specific and algorithm-independent details in Table 1 and 2 and describe the commonalities that exist between them.



 TOC 

A.1.1.  Chord

The identifiers or keys in Chord can be logically considered to be arranged on a circle. Each node in Chord maintains two data structures, a 'successor list' (neighbor table) which is the list of peers immediately succeeding the node ID and a 'finger table' (routing table). The finger table contains the peer-ID and IP address of nodes halfway around the ID space from the node, a quarter-of-the-way, an eighth-of-the-way and so forth in a data structure that resembles a skiplist [11] (Pugh, W., “Skip Lists: A Probabilistic Alternative to Balanced Trees,” 1989.). A node forwards a query for a key k to a node in its finger (routing) table with the highest ID not exceeding k. The skiplist structure ensures that a key can be found in O(log_{2}N) steps where N is the number of nodes in the system.

To join a Chord ring, a node contacts any peer in the Chord network and requests it to lookup its ID. It then inserts itself at the appropriate position in the Chord network. The predecessors of the newly joined node must update their successor lists. The newly joined node should also update its finger table. Successor list is the only requirement for correctness while finger table is used to speedup the lookups.

To guard against node failures, Chord sends keep-alives to its successors and finger table entries and continuously repairs them. The routing table size is log_{2}N.

Chord suggests two ways for key and data replication. In the first method, an application replicates data by storing it under two different Chord keys derived from the data's key. Alternatively, a Chord node can replicate key/value pair on each of its r successors.



 TOC 

A.1.2.  Pastry

Like Chord, the identifiers or keys in Pastry can be logically considered to be arranged on a circle; however, the routing is done in a tree-based (prefix-matching) fashion. Each node in Pastry contains two data structures, a 'leaf-set' and a 'routing table'. The leaf-set L (neighbor table) contains |L|/2 closest nodes with numerically smaller identifiers than the node n and |L|/2 closest nodes with numerically larger identifiers than n and is conceptually similar to Chord successor list [17] (Balakrishnan, H., Kaashoek, F., Karger, D., Morris, R., and I. Stoica, “Looking up data in P2P systems,” 2003.). The routing table contains the IP address of nodes with no prefix match, b bits prefix match, 2b prefix match and so on where b is typically 2, 4, 6, 8 etc. The maximum size of routing table is log_{2^b}N x 2^b. At each step, a node n tries to route the message to a node that has a longest sharing prefix than the node n with the sought key. If there is no such node, the node n routes the message to a node whose shared prefix is at least as long as n and whose ID is numerically closer to the key. The expected number of hops is at most log_{2^b}N.

To join the Pastry network, a node contacts any node in the Pastry network and builds routing tables and leaf sets by obtaining information from the nodes along the path from bootstrapping node and the node closest in ID space to itself. When a node gracefully leaves the network, the leaf-sets of its neighbors are immediately updated. The routing table information is corrected only on demand.

The routing table of a Pastry node is initialized such that each entry i with a common prefix p_{i} is closer to the node (in the network sense) among all other live nodes having a prefix p_{i}. This technique is commonly known as proximity neighbor selection (PNS). Pastry performs recursive lookups. However, PNS and recursive lookups are orthogonal to the Pastry operation.

Pastry replicates data by storing the key/value pair on k nodes with the numerically closest node IDs to a key [14] (Rowstron, A. and P. Druschel, “Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems,” March 2002.). This method is conceptually similar to Chord's replication of key/value pairs on its successor list.



 TOC 

A.1.3.  Kademlia

Like Chord and Pastry, identifiers in Kademlia can be logically considered to be arranged on a circle; however the routing is done in a tree-based (prefix-matching) fashion. Each node in Kademlia contains a routing table. Kademlia contains only one data structure i.e. the routing table. Unlike Chord and Pastry, there are no successor lists or leaf sets although it is possible for a node to maintain one.

Kademlia uses XOR metric to compute the distance between two identifiers, i.e., d(x,y)=x XOR y. XOR metric is non-Euclidean and it offers the triangle property: d(x,y)+d(y,z) >= d(x,z). Essentially, XOR metric is a prefix matching algorithm which tries to route a message to a node with the longest matching prefix and the smallest XOR value for non-prefix bits.

Kademlia maintains up to k entries per routing table row and allows parallel lookups to all nodes in a row. However, this is not really a Kademlia specific feature and other DHT algorithms can implement it by maintaining multiple entries per routing table row. The latest incarnation of Chord contains more than one finger entry per row [20] (Li, J., Stribling, J., Morris, R., and M. Kaashoek, “Bandwidth-efficient management of DHT routing tables,” 2005.).

The routing table size is log_{2}N. The lookup speed can be increased by considering IDs b bits at a time instead of one bit at a time which implies increasing the routing table size. By increasing the routing table size to 2^b x log_{2^b}N x k entries, the number of lookup hops can be reduced to log_{2^b}N.

Kademlia replicates data by finding k closest nodes to a key and storing the key/value pair on them. The Kademlia paper suggests a value of 20 for k.



 TOC 

A.1.4.  Bamboo/OpenDHT

Similar to Pastry. Detailed description ToDo.

          Key       Recursive/  Sequential/ Routing   Neighbor
          length    Iterative   Parallel    table     nodes
                                            name
---------------------------------------------------------------
Chord        160    Both        Sequential  Finger   Successor
                                            table    list
Pastry       128    Recursive   Sequential  Routing  Leaf-set
                                            table
Kademlia     160    Iterative   Parallel    Routing  None
                                            table

Table 1. DHT-independent details of Chord, Pastry and Kademlia.

          Routing    Routing       Symmetric  Learning
          data       table node
          structure  selection
-------------------------------------------------------------
Chord     Skip-list  Immediately   No         No (orig. paper)
                     succeed the              but possible.
                     interval
                     (can be a
                     range)
Pastry    Tree-like  Any node in   Yes        Yes
                     the interval
Kademlia  Tree-like  Any node in   Yes        Yes
                     the interval

Table 2. Algorithm specific details of Chord, Pastry and Kademlia.



 TOC 

A.2.  DHT Commonalities

Table 1 and 2 list the DHT-independent and DHT-specific aspects of Chord, Pastry and Kademlia. From the above discussion, and from these tables, we can think of following commonalities between Chord, Pastry and Kademlia.



 TOC 

A.3.  Unstructured Overlays

An unstructured peer-to-peer network is formed when links between peers are formed arbitrarily. Unlike structured peer-to-peer systems, they do not necessarily provide a routing guarantee of O(log_{b}N) hops nor they guarantee finding an identifier if it exists.



 TOC 

A.3.1.  Gia

In this section, we give a brief description of Gia [19] (Chawathe, Y., Ratnasamy, S., Breslau, L., Lanham, N., Kaashoek, M., and S. Shenker, “Making gnutella-like p2p systems scalable,” 2003.), an unstructured peer-to-peer protocol. Gia extends the Gnutella topology and search algorithms in order to accommodate natural heterogeniety present in the peer-to-peer systems. Like other unstructured protocols, it does not guarantee finding an identifier if it exists. Unlike Chord and Pastry, a Gia node does not have a separate routing and neighbor table. Each Gia node maintains a neighbor table (host cache) comprising a list of other Gia nodes (their IP address, port number, and capacity). Nodes exchange capacity and congestion information with each other. A Gia node uses random walks to locate a resource and biases random walks towards high capacity nodes.

A node Y attempting to join Gia overlay discovers a node already in the overlay using some bootstrap mechanism and sends it a join request. The node receiving the join request selects a small number of high capacity candidate entries in its neighbor table and forwards the request towards the highest capacity node. Each node receiving the join request can add Y as its neighbor if the number of its neighbors is less than some configured maximum. Otherwise, it drops a neighbor with the highest capacity exceeding Y's capacity. The rationale behind dropping the highest capacity node is that it is least likely to suffer from the loss of this neighbor.

In addition to capacity, a Gia nodes takes neighbor nodes' current congestion state into account before forwarding the request. Nodes periodically exchange their congestion state information with each other. A node can be congested if many peers are accessing a resource or service advertised by this peer.

A node forwards the lookup request to the least congested and highest capacity node. Each node remembers the ID of the lookup request it received. The lookup requests are TTL limited.

The pointer to the resource published by a node are replicated on all of its immediate neighbors.



 TOC 

Appendix B.  Examples



 TOC 

B.1.  Join (Pastry)

Consider a Pastry-based overlay below in which node P9 is attempting to join. It will be inserted between nodes P8 and P11. Recall that in Pastry, requests are forwarded in a recursive fashion. Using some bootstrap mechanism outside the scope of this specification, it discovers node P6 which is already in the overlay. It sends a Query request to node P6 inquiring about the overlay parameters such as p2p-algorithm and hash-algorithm. It then sends a join request over UDP to P6. Node P6 forwards the join request in a recursive manner till it reaches node P8. P8 decides that P9 will be inserted immediately after itself in the overlay. It replies with a copy of its neighbor and routing tables in a 200 Ok response. Node P9 eventually receives the response and checks the received neighbor table which contains a pointer to node P11. The nodes in the neighbor table will be the neighbors of P9. Node P9 then sends a Join request to node P11 with S flag set to P11. Node P11 receives the join request with S flag set and knows that the join request should not be forwarded and that a neighbor update has been requested. It then update its left neighbor as P9 instead of P8.

Note that neighbors P8 and P11 do not immediately remove overlay links to each other. Rather, they wait for a random time and exchange their neighbor tables with each other using the Update request. Once both P8 and P11 know that their right and left neighbor links, respectively, have been updated, they choose to remove the link to each other.

+--+        +--+        +--+               +---+       +---+       +---+
|P9|        |P6|        |P8|               |P11|       |P15|       |P20|
+--+        +--+        +--+               +---+       +---+       +---+
  (1) Query
 |---------->|
 |(2) 200    |
 |<----------|
             |           |                   |
 |(3) Join   |
 |---------->|           |                   |
 |           |(4) Join   |                   |
 |           |---------->|                   |
 |           |(5) 200    |                   |
 |           |<----------|                   |
 |(6) 200    |  N(P11)   |                   |
 |<----------| R(P15,P20)                    |
 | N(P11)                                    |
 | R(P15,P20)                                |
 |                                           |
 |(7)             Join(S=1)                  |
 |------------------------------------------>|
 |(8)         200 N(P8,P15) R(P20, P25)      |
 |<------------------------------------------|

 |(9) Update  |
 | R(P11,P20) |
 |<-----------|
 |(10) 200    |
 |----------->|
                       |(11) Update          |
                       |<--------------------|
                       |     N(P9,P15)
                       |(12) 200             |
                       |     N(P6,P9)
                       |-------------------->|

(1) P9 -> P6 (Query)
P2PP(Header(Msg-type=Request; Request-type=Query; TTL=1;
            Trans-Id=0xFDE89F20) )

(2) P6 -> P9 (200 Ok)
P2PP(Header(Msg-type=Response; Response-code=200; Request-type=Query;
            TTL=1; Trans-Id=0xFDE89F20)
     Peer-Info(Peer-ID=P6#ID; Uptime=0x000000D0F9; IP-Ver=4; Num=1;
               HT=host; Port=7080; Peer-address=P6#IP)
     P2P-Options(Recursive-routing=1; Algorithm=SHA1;
                 Algorithm-Length=160; P2P-Algorithm=Pastry;
                 Overlay-ID="Pastry-overlay")
    )

(3) P9 -> P6 (Join) (4) P6 -> P8 (Join) (TTL=159)
P2PP(Header(Msg-type=Request; Request-type=Join; TTL=160;
            Trans-Id=0xD59A3BA1)
     Request-Options(Request-forwarding(F)=Recursive; Insert-on-join=1;
                     P=0; Request-routing-table=1;
                     Request-neighbor-table=0; S=0)
     Peer-Info(Peer-ID=P9#ID; Uptime=0x00000000; IP-Ver=4; Num=1;
              HT=host; Port=7080; Peer-address=P9#IP)
     Overlay-ID=("Pastry-overlay")
    )

(5) P8 -> P6 (200 Ok) (6) P8 -> P9 (200 Ok)
P2PP(Header(Msg-type=Response; Response-code=200; Request-type=Join;
            TTL=159; Trans-Id=0xD59A3BA1)
     Peer-Info(Peer-ID=P8#ID; Uptime=0x00000FD21; IP-Ver=4; Num=1;
               HT=host; Port=7080; Peer-address=P8#IP)
     Neighbor-table(Num=1; Peer-Info(Peer-ID=P11#ID; Uptime=0x00102A5C;
                    IP-Ver=4; Num=1; HT=host; Port=7080;
                    Peer-address=P11#IP))
     Routing-table(Num=2; Peer-Info(PeerID=P15#ID; Uptime=0x000000AB;
                   IP-Ver=4; Num=1; HT=host; Port=7080;
                   Peer-address=P15#IP)
                   Peer-Info(PeerID=P20#ID; Uptime=0x00000EC4;
                   IP-Ver=4; Num=1; HT=host; Port=7080;
                   Peer-address=P20#IP)
    )

(7) P9 -> P11 (Join)
P2PP(Header(Msg-type=Request; Request-type=Join; TTL=1;
                     Trans-Id=0xFA4A3C45)
     Request-Options(Request-forwarding(F)=Recursive; Insert-on-join=1;
                     P=0; Request-routing-table=1;
                     Request-neighbor-table=0; S=1)
     Peer-Info(Peer-ID=P9#ID; Uptime=0x00000000; IP-Ver=4; Num=1;
               HT=host; Port=7080; Peer-address=P9#IP)
     Overlay-ID=("Pastry-overlay"))


(8) 200 Ok
P2PP(Header(Msg-type=Response; Response-code=200; Request-type=Join;
            TTL=0; Trans-Id=0xFA4A3C45)
     Peer-Info(Peer-ID=P11#ID; Uptime=0x00102A5C; IP-Ver=4; Num=1;
               HT=host; Port=7080; Peer-address=P11#IP))
     Neighbor-table(Num=2; Peer-Info(Peer-ID=P8#ID; Uptime=0x00000FD21;
                   IP-Ver=4; Num=1; HT=host; Port=7080;
                   Peer-address=P8#IP)
                   Peer-Info(PeerID=P15#ID; Uptime=0x000000AB;
                   IP-Ver=4; Num=1; HT=host; Port=7080;
                   Peer-address=P15#IP))
     Routing-table(Num=2; Peer-Info(PeerID=P20#ID; Uptime=0x00000EC4;
                   IP-Ver=4; Num=1; HT=host; Port=7080;
                   Peer-address=P20#IP)
                   Peer-Info(PeerID=P25#ID; Uptime=0x0000015A;
                   IP-Ver=4; Num=1; HT=host; Port=7080;
                   Peer-address=P25#IP))
    )

(9) P6 -> P9 (Update)
P2PP(Header(Msg-type=Request; Request-type=Update; TTL=1;
            Trans-Id=0x0183FA78)
     Peer-Info(Peer-ID=P6#ID; Uptime=0x000000D0FA; IP-Ver=4;
               Num=1; HT=host; Port=7080; Peer-address=P6#IP)
     Routing-table(Num=2; Peer-Info(Peer-ID=P11#ID; Uptime=0x00102A5E;
                   IP-Ver=4; Num=1; HT=host; Port=7080;
                   Peer-address=P11#IP)
                   Peer-Info(PeerID=P20#ID; Uptime=0x00000EC7;
                   IP-Ver=4; Num=1; HT=host; Port=7080;
                   Peer-address=P20#IP)
    )

(10) P9 -> P6 (200 Ok)
P2PP(Header(Msg-type=Request; Response-code=200; Request-type=Update;
            TTL=1; Trans-Id=0x0183FA78)
     Peer-Info(Peer-ID=P9#ID; Uptime=0x00000003; IP-Ver=4; Num=1;
              HT=host; Port=7080; Peer-address=P9#IP)
    )

(11) P11 -> P8 (Update)
P2PP(Header(Msg-type=Request; Request-type=Update; TTL=1;
                     Trans-Id=0x2154A526)
     Request-Options(Request-forwarding(F)=Recursive; Insert-on-join=0;
                     P=0; Request-routing-table=0;
                     Request-neighbor-table=1; S=1)
     Peer-Info(Peer-ID=P11#ID; Uptime=0x00102A60; IP-Ver=4; Num=1;
               HT=host; Port=7080; Peer-address=P11#IP))
     Neighbor-table(Num=2; Peer-Info(Peer-ID=P9#ID; Uptime=0x00000001;
                    IP-Ver=4; Num=1; HT=host; Port=7080;
                    Peer-address=P9#IP)
                    Peer-Info(PeerID=P15#ID; Uptime=0x000000AB;
                    IP-Ver=4; Num=1; HT=host; Port=7080;
                    Peer-address=P15#IP)
    )

(12) P8 -> P11 (200)
P2PP(Header(Msg-type=Request; Response-code=200; Request-type=Update;
            TTL=0; Trans-Id=0x2154A526)
     Peer-Info(Peer-ID=P8#ID; Uptime=0x00000FD24; IP-Ver=4; Num=1;
               HT=host; Port=7080; Peer-address=P8#IP)
     Neighbor-table(Num=2; Peer-Info(Peer-ID=P6#ID; Uptime=0x000000D0F9;
                    IP-Ver=4; Num=1; HT=host; Port=7080;
                    Peer-address=P6#IP)
                    Peer-Info(Peer-ID=P9#ID; Uptime=0x00000001;
                    IP-Ver=4; Num=1; HT=host; Port=7080;
                    Peer-address=P9#IP)
    )



 TOC 

B.2.  Join (Gia)

+--+        +--+        +--+               +---+       +---+       +---+
|P9|        |P6|        |P8|               |P11|       |P15|       |P20|
+--+        +--+        +--+               +---+       +---+       +---+
  (1) Query
 |---------->|
 |(2) 200    |
 |<----------|
             |           |                   |
 |(3) Join   |
 |---------->|           |                   |
 |           |(4) Join   |                   |
 |           |---------->|                   |
 |           |(5) 200    |                   |
 |           |<----------|                   |
 |(6) 200    |N(P11,P20) |                   |
 |<----------|                               |
 | N(P11,P20)                                |
 |                                           |
 |                                           |
 |(7)             Join(S=1)                  |
 |------------------------------------------>|
 |(8)         200 N(P8,P15)                  |
 |<------------------------------------------|

 |(9) Update  |
 | N(P11,P20) |
 |<-----------|
 |(10) 200    |
 |----------->|
                       |(11) Update          |
                       |<--------------------|
                       |     N(P9,P15)
                       |(12) 200             |
                       |     N(P6,P9)
                       |-------------------->|




 TOC 

B.3.  Join (Chord)

TBD.



 TOC 

B.4.  Lookup (Pastry)

TBD.



 TOC 

B.5.  Lookup (Gia)

TBD.



 TOC 

B.6.  Lookup (Chord)

TBD.



 TOC 

Authors' Addresses

  Salman A. Baset
  Dept. of Computer Science
  Columbia University
  1214 Amsterdam Avenue
  New York, NY 10027
  USA
Email:  salman@cs.columbia.edu
  
  Henning Schulzrinne
  Dept. of Computer Science
  Columbia University
  1214 Amsterdam Avenue
  New York, NY 10027
  USA
Email:  hgs@cs.columbia.edu


 TOC 

Full Copyright Statement

Intellectual Property

Acknowledgment