Network Working Group                              Mike Kraley (Harvard)
Request for Comments #57                          John Newkirk (Harvard)

June 19, 1970

Thoughts and Reflections on NWG/RFC #54

In the course of writing NWG/RFC #54 several new ideas became

apparent. Since these ideas had not previously been discussed by the

NWG, or were sufficiently imprecise, it was decided not to include them

in the official protocol proffering. We thought, however, that they

might be proper subjects for discussion and later inclusion in the

second level protocol.

I Errors and Overflow

In line with the discussion in NWG/RFC #48, we felt that two

types of errors should be distinguished. One is a real error, such as

an RFC composed of two send sockets. This type of error can only be

generated by a broken NCP. In the absence of hardware and software

bugs, these events should never occur; the correct response upon

detection of such an event was outlined in the description of the ERR

command in NWG/RFC #54.

The other "error" is an overflow condition arising because

finite system resources are exhausted. An overflow condition could

occur if an RFC was received, but there was no room to create the

requisite tables and queues. This is not a real error, in the sense

that no one has done anything incorrect (expect perhaps the system

planners in not providing sufficient table space, etc.) Further, a

[Page 1]

recovery procedure can be well defined, and simply entails repeating the

request at a future time. Thus, we believe an overflow condition should

be distinguished from a real error.

In NWG/RFC #54 an overflow condition was reported by returning

a CLS, as if the connection had been refused. This sequence performs

the necessary functions, and leaves the connection in the correct state,

but the initiating user is misinformed. He is deluded into thinking

that he was refused by the foreign process, when, in fact, this was not

the case. In certain algorithms this difference is crucial.

In further defining error conditions, we felt that it would

be helpful to specify why the error was detected, in addition to

specifying what caused the error. While writing the pseudo-Algol

program mentioned in NWG/RFC #55 we differentiated 9 types of errors

(listed below). We would, therefore, like to propose the extension of

the ERR message to include an 8-bit field following the op code to

designate the type of error. This would be followed by the length and

text fields, as before. We propose these error types;

0 UNSPECIFIED ERROR

1. HOMOSEX (invalid send/rcv pair in an RFC)

2 ILLEGAL OP CODE

3. ILLEGAL LEADER (bad message type, etc.)

4 ILLEGAL COMMAND SEQUENCE

5. ILLEGAL SOCKET SPECIFICATION - COMMAND

6 ILLEGAL COMMAND LENGTH (last command in message was too short)

7. CONNECTION NOT OPEN - DATA

8 DATA OVERFLOW (message longer than advertised available

buffer space)

9 ILLEGAL SOCKET SPECIFICATION - DATA (socket does not exist)

[Page 2]

In light of the other considerations mentioned earlier, we

would also like to propose an additional control command to singify

overflow:

        +-------------+-------------------+---------------------+
        |     OVF     |     my socket     |     your socket     |
        +-------------+-------------------+---------------------+

The format of the message is similar to that of the CLS message, which

it replaces in this context. The socket numbers are 32 bits long and

correspond to the socket numbers in the RFC which is being rejected.

The semantics of an incoming OVF should be indentical to an incoming

CLS; in addition, the user should be informed that he has not been

refused but rather has overtaxed the foreign host's resources.

An alternative to creating a separate control command can be

realized by considering the similarity between a CLS and an OVF.

Conceivably, an eight-bit field could be added to the CLS command to

define its derivation. We believe, however, that this alternative is

conceptually inferior and practically more difficult to implement.

Overflow does not require serious consideration if it is a

significantly rare occurrence. We do not believe this will be the case,

and we further believe that its absence will be an unnecessary

restriction upon the user.

[Page 3]

II. Host Up and Host Down

Significant problems can arise when a host goes down and then

attempts to restart. Two cases can easily be distinguished. The first

is a "soft" crash, where the system has prior notice that the machine is

going down; sufficient time is available to execute pre-recovery

procedures. The other case can be termed a "hard" crash, often the

result of a system failure. Insignificant warning is usually given; but

more important, the state of the machine after recovery is rarely

predictable.

When a host returns from a hard crash, the network will be

in an undefined state. Very probably the NCP's data structures are

destroyed or are meaningless. The network has declared the host dead --

but only to processes which attempted data transmission and were

refused. The only alternative for the crashed host is re-initialization

of its tables. What are the alternatives for the foreign hosts?

We would like to propose the addition of two control commands:

RESET (RST) and RESET REPLY (RSR). Each would consist solely of an op

code with no parameters. Upon receipt of an RST, a host would

immediately terminate all connections with the sending host, but would

not issue any CLS's. The receiver of the RST would also note that the

originator of the RST was alive, and would then echo an RSR to the

sender. When a host receives an RSR, he sould then note that the

echoing host is alive. (The function of RST can be partially simulated

if a host will immediately close all relevant table entries upon

discovering that another host is down.)

Thus, after a hard crash, all connections and request for

connections are terminated. The RST also informs all foreign hosts that

we are again alive, and an RSR is received from every functioning NCP.

A host live table (see NWG/RFC #55) can easily be

[Page 4]

assembled, and establishment of connections can resume.

Related problems also crop up when we consider attempting

to synchronize the network, which may still be carrying messages

generated prior to the crash, with an NCP which has an initialized

environment. We lack the facilities for unblocking links, discarding

messages, etc. -- facilities which this proposal will necessitate.

Further interaction with BBN should resolve these difficulties.

The problems associated with "soft" crashes are not nearly

as pressing, and they demand more sophisticated (i.e., complex)

solutions. Our preliminary experimentation with the network

demonstrates that a good initialization and recovery protocol are far

more necessary.

Many of the ideas presented herin wre germinated and/or

jelled through conversations with Steve Crocker and Jon Postel. We

would also like to acknowledge the assistance of Jim Balter and Charles

Kline of UCLA, who devoted a great deal of effort toward helping develop

the pseudo-Algol program which was the predecessor of much of our recent

documentation.

[ This RFC was put into machine readable form for entry ] [ into the online RFC archives by Katsunori Tanaka 2/98 ]

[Page 5]