modules/wr_endpoint/ep_packet_filter.vhd · klyone-180419-finedel_fixes · Projects / White Rabbit core collection

wr_endpoint: fix (one cause of) packet filter misclassification · 4177c4b6

Wesley W. Terpstra authored Jun 18, 2014

The symptom of this bug is that about 3% of the time a WR endpoint will
power-up such that it always fails to reach track phase state. This is
caused by the endpoint dropping the first PTP packet after calibration.

The packet is dropped, because it is misclassified. This happens because
it is possible for the U_match_buffer and fab_pipe in the RX path to
become desynchronized. When this happens, packets receive the classification
of the previous packet. Since calibration is slow, it is virtually assured
that a BOOTP request is seen, leading to the misclassification of the
following PTP packet.

The U_match_buffer can become desynchronized multiple ways, but the one we
saw "in the wild" is due to the lowering of PFCR0 in wrpc-sw during packet
filter configuration. Due to an unsafe transfer from clk_sys to clk_rx in
ep_packet_filter:p_gen_status, it is possible for the transition of PFCR0 to
cause a glitch that sets done_int high, even though there is no packet being
processed. This puts an excess class tag into U_match_buffer, which leads
to the mismatch between packets and classes. This patch fixes the transfer.

Unfortunately, even after this patch, it is my opinion that this code
remains completely unsafe. The core problem is that desynchronization of
U_match_buffer and fab_pipe is possible at all. This is a very brittle
design. One can imagine many scenarios that can lead to this state, after
which point the WR endpoint will never recover. A simple example: consider
a packet arriving while PFCR0 is switched. ep_rx_path:mbuf_we can then
pulse twice, once, or never for the packet depending on the race condition
between ematch_done and pfilter_done. If this happens, the RX path will
remain permanently desynchronized.

4177c4b6