I've got some kind of problematic issue.
We're having some systems running in a process network at customer site. These systems exchange data using TCP/IP. One part of the system is a computer running windows which receives data from custom-built boxes via ethernet. The problem is that the data transfer from the boxes to the PC stopped a few times at different customer sites. The problem is not reproducible and we don't know wether the boxes dropped the connection or the PC, if it was at the TCP/IP layer or the ethernet layer or something.
The system components are connected via a switch, so ppl think that it would be nice to replace the current switch with one that has logging capabilities. Are there any switches out there that are able to report wether an error occured at the physical layer or the transport layer and give us enough information to debug the problem? I'm thinking of something like a tcpdump output, but I thought that this would add too much load to the switch, as it has to run all the time coz we dunno when the error will happen again... So the switch needs enough space to store the log for, say a few months. Would SNMP help for this purpose? Also note that we cannot run a packet sniffer on the windows PC.
Any help, ideas and thoughts would be appreciated.
To really get to the bottom of this I think you will need a packet capture showing what happens (or doesn't happen) when the problem manifests. Since you say you cannot run a packet sniffer on the PC the way to go might be to use a switch that supports port mirroring (called SPAN on Cisco devices). Define a mirror port on the switch that collects all the traffic from the other ports and place a PC there running something like Ethereal. You can set Ethereal up to use a circular buffer of several files, so that it dumps to file 1, then switches to file 2 and so on to say file 6 then overwrites file 1 again. That way you have a continuous trace going. All you have to do then is tell the users to hit Stop when the problem occurs, and within your files you'll have the trace.
If you simply can't do packet captures at all for some reason, then you'll need some sort of logging. Do these remote boxes respond to pings? If they do, you could put a script on the central PC to ping them in rotation and log the results to a file. If you pinged the switch as well you'd have more evidence of where the problem might lie. Alternatively, can the software on the central PC be made to produce some sort of log of the activity?
Re: Logging for troubleshooting reasons in process network
12 years 11 months ago #11847
You can get switches that will log things but I'm not sure this will really give you what you require. Although it might help support the diagnosis so I wouldn't write it off. With many switches, if you set up SNMP on the switch you can monitor a variety of events including port up/port down. That would tell you if any of your boxes go incommunicado on a physical level. The way to do this would be to set up SNMP on your central PC and install something that can recieve traps. There are several freeware trap reciever porgrams on the net, have a search. Then set the trap destination in then switch to point at the PC and set the trap level to Informational so it reports everything. Another possibility might be to install something on the central PC that queries the switch using SNMP to read the traffic stats for the ports. You could get more detail again using RMON if you wanted. This would show you changes in your traffic pattern that might show you that a problem occurred, but it wouldn't show you what the problem actually was. To really get down to it, you'll need that packet capture