-
Wesley W. Terpstra authored
The symptom of this bug is that about 3% of the time a WR endpoint will power-up such that it always fails to reach track phase state. This is caused by the endpoint dropping the first PTP packet after calibration. The packet is dropped, because it is misclassified. This happens because it is possible for the U_match_buffer and fab_pipe in the RX path to become desynchronized. When this happens, packets receive the classification of the previous packet. Since calibration is slow, it is virtually assured that a BOOTP request is seen, leading to the misclassification of the following PTP packet. The U_match_buffer can become desynchronized multiple ways, but the one we saw "in the wild" is due to the lowering of PFCR0 in wrpc-sw during packet filter configuration. Due to an unsafe transfer from clk_sys to clk_rx in ep_packet_filter:p_gen_status, it is possible for the transition of PFCR0 to cause a glitch that sets done_int high, even though there is no packet being processed. This puts an excess class tag into U_match_buffer, which leads to the mismatch between packets and classes. This patch fixes the transfer. Unfortunately, even after this patch, it is my opinion that this code remains completely unsafe. The core problem is that desynchronization of U_match_buffer and fab_pipe is possible at all. This is a very brittle design. One can imagine many scenarios that can lead to this state, after which point the WR endpoint will never recover. A simple example: consider a packet arriving while PFCR0 is switched. ep_rx_path:mbuf_we can then pulse twice, once, or never for the packet depending on the race condition between ematch_done and pfilter_done. If this happens, the RX path will remain permanently desynchronized.
4177c4b6