papers/ICALEPCS2011_PEER_REVIEWED: use common figures

5bd39806 · Grzegorz Daniluk · 6d8a7fa8 · 5bd39806 · 5bd39806 · 5bd39806
Commit 5bd39806 authored Sep 02, 2014 by Grzegorz Daniluk
14 changed files
--- a/papers/ICALEPCS2011_PEER_REVIEWED/ClockDistribution.tex
+++ b/papers/ICALEPCS2011_PEER_REVIEWED/ClockDistribution.tex
+\section{Clock Distribution}
+
+
+% The resilience of the Clock Distribution translates into continuous and stable 
+% synchronization of all the nodes and switches in the WRN (Table~\ref{tab:requirements}).
+% A loss of time notion in a node can be caused by a link or switch failure - break of clock path 
+% between the TM and the node. In order to prevent synchronization break, redundancy of network 
+% elements (switches, cables) can be introduced ensuring redundant clock paths. However, 
+% the switch-over between redundant elements might introduce instability and render the network 
+% unreliable despite the costly redundancy. Therefore, the seamless switch-over between redundant 
+% clock paths is one of the design-goals to enable network topology redundancy and, as a consequence, 
+% offer robust and stable synchronization. The other reasons for the deterioration of synchronization 
+% accuracy are the variation of external conditions (e.g. temperature) and loss of Ethernet frames with 
+% timing information (PTP).  
+
+%\subsection{Switch-over}
+
+A seamless switch-over between redundant sources of timing (uplink ports) is heavily supported by 
+the Clock Recovery System (CRS) \cite{biblio:TomekMSc} of the switch and the WR extension to PTP 
+(WRPTP)\cite{biblio:WRPTP}. 
+
+Figure~\ref{fig:switch-over} presents an example where a switch (timing slave) is connected 
+(by its uplinks 1 \& 2) 
+to two other switches (primary and secondary masters) -- the sources of timing. On both 
+uplinks the frequency is recovered from the signal and provided to the CRS. Similarly, WRPTP 
+measures delay and offset on each of the links and provides this data to the CRS. 
+The modified Best Master Clock (mBMC) algorithm \cite{biblio:WRPTP} decides which of the 
+timing masters is "better" and elects it the primary, the other is considered secondary (backup).
+The information from {\it uplink 1} (primary) is used to control 
+the CRS and adjust the local time. However, at any time all the necessary information from the 
+{\it uplink 2} is available and a seamless switch-over can be performed in case of 
+primary master failure \cite{biblio:TomekMSc}.
+
+\begin{figure}[t]
+\centering
+\includegraphics[width=3.2in]{../../figures/robustness/clockDistribution.eps}
+\caption{Seamless switch-over.}
+\label{fig:switch-over}
+\end{figure}
+
+%\subsection{Variable conditions and loss of PTP messages}
+In addition to the switch-over-related synchronization instability, the variation of external temperature 
+can cause an accuracy degradation. This problem, however, is solved by the PTP standard itself. By 
+frequent link delay measurements, the fluctuation is compensated. 
+
--- a/papers/ICALEPCS2011_PEER_REVIEWED/ControlDataDistribution.tex
+++ b/papers/ICALEPCS2011_PEER_REVIEWED/ControlDataDistribution.tex
+\section{Data Resilience}
+
+
+\subsection{Forward Error Correction}
+\label{sec:fec}
+
+The objective of the FEC scheme is to decrease the loss rate of the CMs, preferably, 
+to less then one per year. WR uses as a physical medium Fiber Optic and CAT-5. The number 
+of received corrupted bits compared to the total number of received bits is called Bit Error Rate 
+(BER). The value of BER  characterizes a physical medium and can be used to characterize the entire 
+switched network.
+%  if the following factors are taken into account: 
+% (1) {\it type of cabling} (fiber/twisted pair),
+% (2) {\it logic topology},
+% (3) {\it network address} (broadcast/unicast).
+A WRN can be seen as a Packet Erasure Channel (PEC) or as a Binary Erasure Channel (BEC) depending 
+on the effect of a bit error on the frame. If the frame is lost (e.g. dropped by the switch due to 
+a corrupted header or lost during switch-over between redundant components), the WRN is a PEC. 
+If the bit error happens in the link between a switch and node, a corrupted frame 
+\modified{can be used (optional)}
+%is used 
+to attempt frame recovery. In such case, the channel is called BEC. Each type of channel requires 
+a different FEC solution. Therefore two concatenated FECs are used in WR. 
+Reed-Solomon (R-S) %\cite{biblio:r-s} 
+%\cite{biblio:coding} 
+coding is used for the PEC and 
+allows to encode k original-frames into n encoded-frames ($n>k$). 
+Reception of any k encoded-frames can be used to decode the original frames. 
+\modified{Hamming coding with additional parity (SEC-DED)} 
+%Hamming coding 
+is used for the BEC and allows to detect up to two simultaneous bit errors and 
+correct a single error.
+These two schemes (R-S and Hamming) are combined to encode each CM -- it is 
+split into two and encoded using R-S into four messages (two original and two 
+with redundant data). Each of the four messages is then encoded using Hamming. Such encoded 
+messages are sent in a burst of 4 Ethernet frames. Reception of any two of these frames enables 
+to decode the original Control Messages. 
+A systematic analysis, using the BER characteristic of the WRN, proves that the presented FEC scheme 
+guarantees less than single CM lost per year due to physical medium 
+imperfection, as can be seen from Table~\ref{tab:gsi_cern_fec}. 
+
+\begin{table}[ht]
+ 	\begin{center}
+\caption{GSI and CERN FEC characteristics.}
+\begin{tabular}{|p{4cm}|c|c|} \hline
+%	\cline{2-3}
+%	&  \multicolumn{2}{|c|}{Use Case} \\ \cline{2-3}
+\rowcolor{gray!35}{}
+{\bf Parameter}	&  {\bf GSI} & {\bf CERN} \\ \hline
+	\multicolumn{1}{|p{4cm}|}{Control Message length} & 500 bytes & 1500 bytes     \\ \hline
+	\multicolumn{1}{|p{4cm}|}{Control Message per year} & $3.145 10^{11} $ &$  3.145 10^{8} $ \\ \hline
+	\multicolumn{1}{|p{4cm}|}{Max Bit Correct.} & 1 & 1  \\ \hline
+%	\multicolumn{1}{|p{4cm}|}{Parity-Check Bits} & 13    &  13   \\ \hline
+%	\multicolumn{1}{|p{4cm}|}{PEC Code Overhead} & 3  & 2 \\ \hline
+%	\multicolumn{1}{|p{4cm}|}{Payload Length} & 400 b  & 800b \\ \hline
+	\multicolumn{1}{|p{4cm}|}{Payload Length} & \modified{294 bytes}  & \modified{854 bytes} \\ \hline
+	\multicolumn{1}{|p{4cm}|}{Num Encoded Frames} & 4  & 4 \\ \hline
+	\multicolumn{1}{|p{4cm}|}{Needed Frames to Receiver} & 2 & 2 \\ \hline
+	\multicolumn{1}{|p{4cm}|}{Probability of Loosing a CM} & $10^{-14}$ & $10^{-13}$\\ \hline
+	\end{tabular}   
+	
+	\label{tab:gsi_cern_fec}
+	\end{center}
+\end{table}
+
+\subsection{Rapid Spanning Tree Protocol (RSTP)}
+
+In an Ethernet network with redundant topology, the problem of loops (causing "broadcast storms") 
+is handled by the Rapid Spanning Tree Protocol (RSTP)
+% \cite{biblio:IEEE8021D}
+. It creates a loop-free 
+logical topology by blocking appropriate ports in switches, and unblocks them in case of topology 
+break (due to element failure).
+
+The functionality provided by the RSTP is essential for the WRN. However, the convergence speed 
+provided by the standard implementation of the RSTP (milliseconds 
+%\cite{biblio:RSTPperf} 
+at best) 
+would cause many CMs to be lost during the process. This is not acceptable, we need 
+a solution which is fast enough to prevent loosing the CMs at all. Since we know the 
+size-range of the CMs (Table~\ref{tab:requirements}) and how they are FEC-encoded into Ethernet frames,
+we can calculate the maximum value of the convergence time: 3$\mu s$. This time is smaller than 
+the duration of transmitting a single frame with FEC-encoded CM -- this ensures that no more than 
+two frames with FEC-encoded CM are lost, thus the CM can be recovered.
+
+In order to achieve a convergence time of 3$\mu s$, the switch-over between active 
+and backup connections needs to be performed in the hardware as soon as the link-down is detected. 
+It can only be done if the alternative topology is known in advance. The knowledge of alternative 
+topology is translated into an RSTP-assignment of alternative and backup roles of switch ports, 
+i.e at least one port with alternative role must be identified in every switch 
+(except the topology-root switch).    
+%\modified{, i.e at least one port with each of these roles must be identified in every switch}.
+%
+%If at least one port of a switch is assigned an alternative role, it means that 
+%the RSTP algorithm establishes more than one path to the topology-root switch and therefore 
+%the alternative topology is know in advance. 
+%Such ports are identified when the RSTP algorithm establishes more than one path to the 
+%topology-root switch and all paths can be used simultaneously, 
+%
+If we ensure, by restricting the topology, that RSTP identifies the alternative links, 
+we can use its data to feed the hardware, consequently achieving the required convergence time 
+and staying standard-compatible: 
+the hardware switch-over is just a faster RSTP-driven convergence. The required topology 
+restrictions, described in \cite{biblio:robustness}, greatly overlap with these imposed by 
+the Time Distribution. 
+
+ 
+
--- a/papers/ICALEPCS2011_PEER_REVIEWED/Determinism.tex
+++ b/papers/ICALEPCS2011_PEER_REVIEWED/Determinism.tex
+\section{Determinism}
+
+% The delivery latency of an Ethernet frame varies with cable length and the number of hops (switches) 
+% it has to traverse to reach its destination, the traffic load on the way and 
+% the assigned Class of Service (CoS, \cite{bilbio:vlan}). 
+A carefully configured and properly used WRN offers deterministic Ethernet frame delivery 
+thanks to the implementation of CoS and the fact that the delay introduced by the switch can be 
+verified by analysis of {\bf publicly available source code} \cite{biblio:whiteRabbit}. 
+Such analyses were performed to verify the worst-case upper bound 
+delivery latency of a CM against the requirements listed in the Table~\ref{tab:requirements}. 
+The results, presented in Table~\ref{tab:CMlatency} ({\it Store-and-forward} column), 
+take into account the fact that a CM is encoded into 4 Ethernet frames (as required by the FEC 
+and described in the next Section), it is sent with the highest priority (CoS) and it always 
+traverses 3 hops.
+
+\begin{table}[ht]
+\caption{Control Message(CM) deliver latency estimations.} 
+\centering
+
+	\begin{tabular}{| c | c | c | c | c | c |}          \hline
+\rowcolor{gray!35}{}
+               & \multicolumn{4}{|>{\columncolor{gray!35}}c|}{\textbf{CM deliver latency}}                    \\ \cline{2-5}
+\rowcolor{gray!35}{}
+\textbf{CM size}& \multicolumn{2}{|>{\columncolor{gray!35}}c|}{\textbf{Store-and-forward}} 
+                &\multicolumn{2}{|>{\columncolor{gray!35}}c|}{\textbf{Cut-through}}                           \\\cline{2-5}
+\multicolumn{1}{|>{\columncolor{gray!35}}c|}{} &
+\multicolumn{1}{|>{\columncolor{gray!35}}c|}{GSI} &
+\multicolumn{1}{|>{\columncolor{gray!35}}c|}{CERN} &
+\multicolumn{1}{|>{\columncolor{gray!35}}c|}{GSI} &
+\multicolumn{1}{|>{\columncolor{gray!35}}c|}{CERN} \\ \hline
+%		&    GSI           & CERN          &    GSI           & CERN          \\ \hline
+%200 bytes      & ???$\mu s$       & ???$\mu s$    & ??$\mu s$        & ???$\mu s$    \\ \hline
+500 bytes      & 221$\mu s$       & 283$\mu s$    & 76$\mu s$        & 118$\mu s$    \\ \hline
+1500 bytes     & 285$\mu s$       & 325$\mu s$    & 102$\mu s$       & 142$\mu s$    \\ \hline
+5000 bytes     & 324$\mu s$       & 364$\mu s$    & 162$\mu s$       & 202$\mu s$    \\ \hline
+\end{tabular}
+\label{tab:CMlatency}
+\end{table}
+
+The analysis revealed that GSI's requirements are not fulfilled: the upper-bound delivery latency
+for the required size of CM and max distance of 2km is greater then 100$\mu s$. 
+
+The solution to decrease delivery latency is targeted into the CD only and 
+takes advantage of its characteristics (broadcast within a VLAN, sent by privileged node). 
+We propose to break the highest priority of 
+the CoS into two (unicast and broadcast) and use the highest priority broadcast Ethernet traffic only for 
+the CD. Moreover, this particular traffic shall be forwarded using the cut-through method 
+(unlike the store-and-forward method used normally in the switch) which can be effectively fast 
+for the broadcast traffic with a single source (DM). The results, 
+presented in Table~\ref{tab:CMlatency} ({\it Cut-through} column), show a significant improvement. 
+The solution requires hardware-supported cut-through forwarding in the switch as described 
+in \cite{biblio:robustness}. 
+
--- a/papers/ICALEPCS2011_PEER_REVIEWED/FailureStudy.tex
+++ b/papers/ICALEPCS2011_PEER_REVIEWED/FailureStudy.tex
+\section{Failure Study}
+
+One of the main possible reasons for WRN failure, which affects both Timing and Data Distribution, is 
+a malfunction of its elements (switches or links). Since the distribution of information 
+in the WRN is of one-to-all character (Data/Timing Master to all nodes), all the elements of the WRN are 
+considered Single Points of Failure (SPoF)\cite{biblio:mtbf}. Malfunction of any SPoF 
+results in failure of the entire system.
+SPoFs can be eliminated by introducing redundancy of the system components. Due to its special features 
+(distribution of frequency over physical layer) and strict requirements (determinism, low data loss), 
+the number of possible redundant topologies of the WRN is restricted, as explained in the 
+following sections. 
+
+Imperfections of the physical medium as well as switching between redundant elements of the network 
+(which takes time) can cause loss or corruption of data. The deterministic and \modified{mostly} broadcast character 
+of the data distribution in the WRN enforces application of the Forward Error Correction (FEC) 
+%\cite{biblio:coding} 
+-- adding redundant information on transmission to enable recovery of lost or corrupted data 
+on reception. This brings constant data overhead and the probability that the added redundancy is 
+not sufficient to recover the data. However, it is the price to pay for ensuring low latency 
+and determinism of data delivery in the WRN. 
+
+The delivery latency of an Ethernet frame varies with cable length and the number of hops (switches) 
+it has to traverse to reach its destination, the traffic load on the way and 
+the assigned Class of Service (CoS). Therefore, to ensure the required determinism 
+of the CD delivery, we need to make sure that there is no congestion of Ethernet frames 
+carrying CMs. Moreover, the number of hops (the latency introduced by them) needs to be 
+sufficiently small, which can be done by restricting the topology. 
+
+The resilience of the Clock Distribution translates into continuous and stable 
+synchronization of all the nodes and switches in the WRN (Table~\ref{tab:requirements}). Although, 
+the network redundancy eliminates SPoFs, the switch-over between redundant elements might introduce 
+instability and render the network unreliable despite the costly redundancy. 
+Therefore, a seamless switch-over between redundant clock paths needs to be ensured. 
+Another reason for the deterioration of the synchronization 
+accuracy is the variation of external conditions (e.g. temperature) which needs to be compensated.
+
+% In terms of the Data Distribution reliability, the topology redundancy can turn out to be 
+% useless, if the switch-over between redundant elements causes more data to be lost then the 
+% capabilities of FEC scheme.
+% {\it [add here, change the rest]}
+% In summary, we need investigate how to :
+% \begin{Itemize}
+%   \item  eliminate/decrease data loss due to :
+%     \begin{Itemize}
+%       \item physical medium imperfection,
+%       \item switch over between redundant elements,
+%       \item traffic congestion,
+%     \end{Itemize}
+%   \item eliminate synchronization instability due to:
+%     \begin{Itemize}
+%       \item switch over between redundant data paths,
+%       \item external condition variations,
+%       \item Ethernet frame loss (PTP),
+%     \end{Itemize}
+%   \item ensure required upper-bound delivery latency of Control Data.
+% \end{Itemize}
--- a/papers/ICALEPCS2011_PEER_REVIEWED/Makefile
+++ b/papers/ICALEPCS2011_PEER_REVIEWED/Makefile
+all : WhiteRabbit.pdf
+
+.PHONY : all clean
+
+WhiteRabbit.pdf : WhiteRabbit.tex 
+	latex $^
+	bibtex WhiteRabbit
+	latex $^
+	latex $^
+	dvips -j0 WhiteRabbit
+	ps2pdf  -dPDFX -dEmbedAllFonts=true -dSubsetFonts=true -dEPSCrop=true WhiteRabbit.ps
+
+clean :
+	rm -f *.eps *.pdf *.dat *.log *.out *.aux *.dvi *.ps *~ *.bbl *.blg
+
--- a/papers/ICALEPCS2011_PEER_REVIEWED/OverallReliability.tex
+++ b/papers/ICALEPCS2011_PEER_REVIEWED/OverallReliability.tex
+\section{Overall Reliability}
+
+The final equation of the WRN reliability is a sum of the data and clock distribution reliabilities.
+The clock distribution is assumed to be sufficiently accurate as long as there is a connection 
+between the TM and all the nodes. The same applies to the CD distribution: 
+as long as there is a valid connection, the FEC makes sure that the data is delivered with 
+a sufficient reliability and the latency calculations prove it to be deterministic while the 
+congestion is prevented by CoS and limited number of data sources (DM). Consequently, the overall 
+reliability is strongly dependent on the WRN topology, which needs to be appropriate for the proposed 
+solutions (SyncE, H/W-supported RSTP, upper-bound latency). 
+
+For the comparison of different network topologies, we consider the reliability of a network of 
+switches. 
+%with M inputs (connected to DM/TM). 
+Each node is connected to such a network with M links 
+(each to a separate switch). The value of M reflects the level of redundancy 
+(M=1 for no redundancy, M=2 for double redundancy, etc).
+
+
+In the calculations of the network reliability we used the idea of Mean Time Between Failure (MTBF) 
+and its relation with the failure probability presented in \cite{biblio:mtbf} 
+(a very simplified mathematical model). In order to calculate the MTBF of the entire network, we need the 
+MTBFs of each network component: switches and links. Since the WR switches are still under 
+development (no MTBF measured), we used representative values for CISCO switches 
+({2, 10 and 100}$*10^4$[h]). Two estimation methods were used: "Fault Tree analysis" 
+\cite{biblio:faultTree} and analytic. Both provide just rough estimations of the reliability. 
+The former allowed to estimate two-terminal reliability (DM to single node) 
+%\cite{biblio:INF_TECH} 
+of simple non/double/triple-redundancy topologies ($P_f$). The most desired value is the 
+all-terminal network reliability ($P_{f\_Network}$), where : $P_f < P_{f\_Network} < N_{nodes}*P_f$. 
+Table~\ref{tab:2000nodesReliability} 
+presents rough estimations of $P_{f\_Network}$ using analytic calculations for the three considered 
+topologies ($MTBF_{Switch}$=200 000[h]). However, to meet the requirement of $\approx$2000 nodes and 
+only three network layers (hops), 
+\modified{the Data Master node is connected to more separate switches than 
+the level of redundancy (M).}
+% the topologies are of the type M-inputs/N-outputs, where 
+% $N \geq M$.
+The estimations show that a triple redundancy topology can barely satisfy the requirements by CERN 
+(Table~\ref{tab:requirements}).
+
+% \begin{figure}[t]
+% 	 \centering
+% 	\includegraphics[width=3.4in]{fig/threeTopologies.ps}
+% 	\caption{Examples of topologies with different level of redundancy.}
+% 	\label{fig:threeTopology}
+% \end{figure}
+
+\begin{table}[ht]
+
+%\caption{Different topologies ($\approx 2000$ nodes).}
+\caption{WRN topologies's reliabilities.}
+\centering
+%\rowcolors {0}{gray!35}{}
+
+\begin{tabular}{| c | c | c | c |}        \hline
+%{\bf Redundancy}& \textbf{Switches}  & \multicolumn{2}{| c |}{\textbf{$MTBF_{Switch}$=  20 000[h] }} \\
+%                &                    & $P_f$                       & MTBF[h]               \\ \hline
+\rowcolor{gray!35}{}
+{\bf Redundancy}& \textbf{Switches}  & $P_f$                       & MTBF[h]               \\ \hline
+No              &  127               & $ 2.08*10^{-3}$             & $ 5.77*10^{3}$        \\ \hline
+Double          &  292               & $ 4.71*10^{-7}$             &  $ 2.55*10^{7}$       \\ \hline
+Triple          &  495               & $ 3.06*10^{-11}$            &  $ 4.08*10^{11}$      \\ \hline
+\end{tabular}
+\label{tab:2000nodesReliability}
+\end{table}
+
--- a/papers/ICALEPCS2011_PEER_REVIEWED/ReliabilityDefinition.tex
+++ b/papers/ICALEPCS2011_PEER_REVIEWED/ReliabilityDefinition.tex
+\section{Definition of reliability in a WRN}
+
+A WRN, consisting of White Rabbit Switches (switches) connected by fiber 
+or copper, is meant to transport information among White Rabbit Nodes (nodes). We distinguish 
+two types of information distributed over the WRN: 
+%(1) {\bf Timing} (frequency and International Atomic Time) and 
+(1) {\it Timing} (frequency and Coordinated Universal Time) and 
+(2) {\it Data} (the Ethernet traffic).
+This translates into two types of services provided by the WRN which have their own requirements and
+can be handled separately. The requirements are defined by GSI and CERN as the prospective 
+users of WR to control their accelerators.
+
+
+\subsection{Timing Distribution}
+
+Timing is distributed in the WRN from a switch/node called Timing Master (TM) 
+to all the other nodes/switches in the network. 
+% The TM is usually connected 
+% to the external source, such as Global Positioning System (GPS) receiver. 
+All the devices in the 
+WRN lock their frequency (syntonize) and adjust their local clocks (synchronize) to that of the TM. 
+The deviation between the clock of the TM and that of any other node/switch is called {\bf accuracy}. 
+A stable and continuous synchronization of all the nodes with an appropriate accuracy is the key 
+requirement of the Timing Distribution in the WRN.
+
+\subsection{Data Distribution}
+
+The critical data distributed over the WRN is the one carrying sets of commands (events) which are 
+organized into Control Messages (CM). The CMs are sent by a privileged node (Data Master, DM) in the 
+payload of the Ethernet frame(s). Therefore, the Data Distribution in the WRN is broken into 
+(1) {\it Control Data (CD)}  -- the Ethernet frames carrying CMs, critical, and 
+(2) {\it Standard Data (SD)} -- the Ethernet frames which do not carry CMs, non-critical.
+The reliability of the WRN depends on the successful delivery of the CD to all 
+the designated nodes. The CMs are always broadcast within a VLAN
+% \cite{bilbio:vlan}
+, which can span 
+the entire network. The worst-case upper bound of their delivery latency from the DM to any node in 
+the network, regardless of it's location ({\bf maximum distance from the DM}), is required to be 
+guaranteed by the network -- this is {\bf a determinism} requirement. 
+
+\subsection{Reliability of the WRN}
+
+The reliability of the WRN relies on the {\bf deterministic} delivery of the CD 
+to all the designated nodes and their sufficiently {\bf accurate and stable synchronization}.  
+This means that the WRN is considered non-functional if one or more of the following occur:
+\begin{itemize}
+  \item A node is synchronized with insufficient accuracy.
+  \item A designated node receives corrupted CD or no CD.
+  \item The upper-bound delivery latency has been exceeded.
+\end{itemize}
+% (1) A node is synchronized with insufficient accuracy;
+% (2) A designated node receives corrupted CD or no CD;
+% (3) The upper-bound delivery latency has been exceeded.
+Unreliability is translated into the number of CMs considered lost (not delivered, delivered 
+corrupted or in a non-deterministic way) in a given period of time. During this time,  
+the synchronization must be always of the required quality. 
+Quantitative requirements of the accelerator facilities are listed in Table~\ref{tab:requirements}.
+
+\begin{table}[ht]
+\caption{GSI's and CERN's requirements summary.} 
+\centering
+	\begin{tabular}{| l | c | c |}                        \hline
+%\textbf{Requirement}& \multicolumn{2}{|c|}{\textbf{Value(s)}}     \\
+\rowcolor{gray!35}{}
+\textbf{Requirement}     & {\bf GSI}        & {\bf CERN}          \\ \hline
+Max latency    		 & 100$\mu s$       & 1000$\mu s$         \\ \hline
+CM failure rate          & $3.17*10^{-12}$  & $3.17*10^{-11}$     \\ \hline
+CMs lost per year        & 1                & 1                   \\ \hline
+$d_{max}$ from DM        & 2km              & 10km                \\ \hline
+CM size 		 & 200-500 bytes    & 1200-5000 bytes     \\ \hline
+Accuracy	  	 & probably 8ns     & 1$\mu s$ to  ~2ns   \\
+%accuracy                 &                  & few nodes  ~2ns     \\
+\hline
+
+\end{tabular}
+\label{tab:requirements}
+\end{table}
--- a/papers/ICALEPCS2011_PEER_REVIEWED/WhiteRabbit.tex
+++ b/papers/ICALEPCS2011_PEER_REVIEWED/WhiteRabbit.tex
+
+
+%\documentclass{JAC2003}  % A4
+%\documentclass[acus]{JAC2003} % US
+\documentclass[reprint, superscriptaddress,aps,prstab]{revtex4-1}
+
+\usepackage{graphicx}
+\usepackage{booktabs}
+\usepackage{color}
+\usepackage{multirow}
+%\usepackage{multicol}
+\usepackage[table]{xcolor}
+\usepackage{colortbl}
+\usepackage{array}
+
+%\setlength{\titleblockheight}{27mm}
+
+\hyphenation{op-tical net-works semi-conduc-tor}
+
+
+
+%\newcommand \modified[1]{{\textcolor{red}{#1}}}
+\newcommand \modified[1]{{\textcolor{black}{#1}}}
+
+\begin{document}
+
+\title{RELIABILITY IN A WHITE RABBIT NETWORK}
+
+\input{authors}
+
+\input{abstract}
+
+\maketitle
+
+
+\input{introduction}
+
+\input{ReliabilityDefinition}
+
+\input{FailureStudy}
+
+\input{Determinism}
+
+\input{ControlDataDistribution}
+
+\input{ClockDistribution}
+
+
+
+\input{OverallReliability}
+
+\input{conclusion}
+
+
+
+\bibliographystyle{IEEEtran}
+\bibliography{IEEEabrv,./biblio}
+
+\end{document}
+
+
--- a/papers/ICALEPCS2011_PEER_REVIEWED/WhiteRabbitNotes.bib
+++ b/papers/ICALEPCS2011_PEER_REVIEWED/WhiteRabbitNotes.bib
--- a/papers/ICALEPCS2011_PEER_REVIEWED/abstract.tex
+++ b/papers/ICALEPCS2011_PEER_REVIEWED/abstract.tex
+\begin{abstract}
+
+White Rabbit (WR) is a time-deterministic, low-latency Ethernet-based network which enables 
+transparent, sub-ns accuracy timing distribution. It is being developed to replace 
+the General Machine Timing (GMT) 
+%\cite{biblio:GMT} 
+system currently used at CERN and will become 
+the foundation for the control system of the Facility for Antiproton and Ion Research (FAIR) 
+at GSI. High reliability is an important issue in WR's design, 
+since unavailability of the accelerator's 
+control system will directly translate into expensive downtime of the machine. 
+A typical WR network is required to lose not more than a single message per year. 
+Due to WR's complexity, the translation of this real-world-requirement into 
+a reliability-requirement constitutes an interesting issue on its own -- a WR network 
+is considered functional only if it provides all its services to all its clients at any time. 
+This paper defines reliability in WR and describes how it was addressed by dividing it into 
+sub-domains: deterministic packet delivery, data 
+%redundancy, 
+resilience, 
+topology redundancy and clock 
+resilience. The studies show that the Mean Time Between Failure (MTBF) of the WR Network 
+is the main factor affecting its reliability. Therefore, probability calculations for 
+different topologies were performed using the "Fault Tree analysis" and analytic estimations. 
+Results of the study show that the requirements of WR are demanding. Design changes might be needed 
+and further in-depth studies required, e.g. Monte Carlo simulations. Therefore, a direction 
+for further investigations is proposed.
+\end{abstract}
--- a/papers/ICALEPCS2011_PEER_REVIEWED/authors.tex
+++ b/papers/ICALEPCS2011_PEER_REVIEWED/authors.tex
+
+%\author
+%{
+%      Maciej Lipi\'{n}ski, Javier Serrano, Tomasz W\l{}ostowski, CERN, Geneva, Switzerland\\
+%      Cesar Prados, GSI, Darmstadt, Germany       
+%}
+
+
+\author{C.Prados}
+\affiliation{GSI Helmholtz Centre for Heavy Ion Research, Darmstadt, Germany}
+\author{Maciej Lipi\'{n}ski}
+\affiliation{CERN, Geneva, Switzerland}
+\author{J.Serrano}
+\affiliation{CERN, Geneva, Switzerland}
+\author{T. Wlostowski}
+\affiliation{CERN, Geneva, Switzerland}
+
--- a/papers/ICALEPCS2011_PEER_REVIEWED/biblio.bib
+++ b/papers/ICALEPCS2011_PEER_REVIEWED/biblio.bib
+@standard{biblio:IEEE8021D,
+  title		= "IEEE Standard for Local and metropolitan area networks 
+                   Media Access Control (MAC) Bridges",
+  organization	= "IEEE",
+  address	= "New York",
+  number	= "802.1D",
+  year		= "2004",
+}
+
+
+@standard{biblio:IEEE1588,
+  title		= "IEEE Standard for a Precision
+		   Clock Synchronization Protocol for Networked Measurement and Control Systems",
+  organization	= "IEEE",
+  address	= "New York",
+  number	= "1588-2008",
+  year		= "2008",
+}
+
+@standard{biblio:IEEE8023,
+  title		=  "IEEE Standard for
+		   Information Technology--Telecommunications and Information Exchange Between
+		   Systems--Local and Metropolitan Area Networks--Specific Requirements Part 3:
+		   Carrier Sense Multiple Access With  Collision Detection (CSMA/CD) Access Method
+		   and Physical Layer Specifications - Section Three",
+  year		= "2008",
+  organization	= "IEEE",
+  address	= "New York",
+  number	= "802.3-2008",
+}
+
+@standard{bilbio:vlan,
+  title         = "{IEEE Standard for Local and metropolitan area networks 
+                     Virtual Bridged Local Area Networks}", 
+  year          = "2005", 
+  organization  = "IEEE",
+  address       = "New York",
+  number        = "802.1Q-2005"
+}
+
+
+@standard{biblio:SynchE,
+  title		= "Timing characteristics of a synchronous Ethernet equipment slave clock {(EEC)}",
+  year		= "2007",
+  number	= "G.8262",
+  organization	= "ITU-T",
+}
+
+@inproceedings{biblio:ISPCS2011,
+  author        = "M.Lipinski, T.Wlostowski, J.Serrano and P.Alvarez",
+  title         = "White Rabbit: a {PTP} Application for robust sub-nanosecond synchronization",
+  booktitle     = "Proceedings of ISPCS2011",
+  address       = "Munich, Germany",
+  year          = "2011",
+}
+
+@inproceedings{biblio:GMT,
+  author        = "J.Serrano and P.Alvarez and D.Dominguez, J.Lewis",
+  title         = "Nanosecond Level {UTC} Timng Generation and Stamping in {CERN}'s {LHC}",
+  booktitle     = "Proceedings of ICALEPSC2003",
+  address       = "Gyeongju, Korea",
+  year          = "2003",
+}
+
+@techreport{biblio:FAIRtimingSystem,
+  author        = "T. Fleck and C. Prados and S. Rauch and M. Kreider",
+  title         = "{FAIR} Timing System",
+  institution   = "GSI",
+  address       = "Darmstadt, Germany",
+  year          = "2009",
+  note          = "v1.2",
+}
+
+@inproceedings{biblio:distOscilloscope,
+  author        = "S. Deghaye and D. Jacquet and I. Kozsar and J. Serrano",
+  title         = "{OASIS}: A NEW SYSTEM TO ACQUIRE AND DISPLAY THE ANALOG SIGNALS FOR {LHC}",
+  booktitle     = "Proceedings of ICALEPCS2003",
+  address 	= "Gyeongju, Korea",
+  year          = "2003",
+}
+
+@inproceedings{biblio:PAC11,
+  author        = "J.Serrano, P.Alvarez, M.Lipinski and T.Wlostowski",
+  title         = "Accelerator Timing Systems Overview",
+  booktitle     = "Proceedings of PAC11",
+  address 	= "New York, USA",
+  year          = "2011",
+}
+
+@Inproceedings{biblio:WRproject,
+  author        = "J. Serrano and P. Alvarez and M. Cattin and E. G. Cota and others",
+  title         = "{The White Rabbit Project}",
+  booktitle     = "ICALEPCS",
+  address 	= "Kobe, Japan",
+  year          = "2009",
+}
+
+@Misc{biblio:WRPTP,
+  author 	= "E.G. Cota and M. Lipinski and T. Wlostowski and E.V.D. Bij and J. Serrano",
+  title 	= "{White Rabbit Specification: Draft for Comments}",
+  note          = "v2.0",
+  month		= "july",
+  year 		= "2011",
+  howpublished	= {\url{http://www.ohwr.org/documents/21}}
+}
+
+@Misc{biblio:CERNwrControlAndTiming,
+  author 	= "J-C.Bau and M.Lipinski",
+  title 	= "{White Rabbit CERN Control and Timing Network}",
+  month		= "July",
+  year 		= "2011",
+  howpublished	= {\url{http://www.ohwr.org/documents/85}}
+}
+
+@Misc{biblio:robustness,
+  author 	= "C.Prados and M.Lipinski",
+  title 	= "{White Rabbit and Robustness}",
+  month		= "March",
+  year 		= "2011",
+  howpublished	= {\url{http://www.ohwr.org/documents/103}}
+}
+
+@mastersthesis{biblio:TomekMSc,
+  author 	= "T.Wlostowski",
+  title 	= "Precise time and frequency transfer in a {White} {Rabbit} network",
+  month		= "may",
+  year 		= "2011",
+  school 	= "Warsaw University of Technology",
+  howpublished	= {\url{http://www.ohwr.org/documents/80}}
+}
+
+@Inproceedings{biblio:Takahide,
+  author 	= "Takahide Murakami and Yukio Horiuchi",
+  title 	= "{A Master Redundancy Technique in IEEE 1588 Synchronization with a Link Congestion
+		   Estimation}",
+  booktitle     = "Proceedings of ISPCS",
+  year 		= "2010",
+}
+
+@electronic{biblio:whiteRabbit,
+  title 	= "{White Rabbit}",
+  howpublished	= {\url{http://www.ohwr.org/projects/white-rabbit}}
+}
+
+@article{biblio:ohl,
+  author        = "M.Giampietro",
+  title         = "Hardware joins the open movement",
+  journal       = "CERN Courier",
+  address 	= "CERN, Geneva",
+  year          = "2011",
+  howpublished	= {\url{http://cerncourier.com/cws/article/cern/46054}},
+}
+@article{biblio:RSTPperf,
+  authors       = "Pallos, R., Farkas, J., Moldovn, I. and Lukovszki, C.",
+  title         = "Performance of Rapid Spanning Tree Protocol in Access and Metro Networks",
+  journal       = "2nd International ICST Conference on Access Networks",
+  year          = "2007",
+}
+
+@article{biblio:r-s,
+  author        = "I.S.Reed, G.Solomon",
+  title         = "{Polynomial Codes Over Certain Finite Fields}",
+  journal       = "SIAM Journal of Applied Math",
+  address 	= "USA",
+  year          = "1960",
+}
+
+@book{biblio:mtbf,
+  author        = "K.Dooley",
+  title         = "Designing Large-Scale LANs",
+  publisher     = "O'Reilly",
+  year          = "2002",
+}
+
+@book{biblio:coding,
+  author        = "S.Lin, D.J.Castello",
+  title         = "Error Control Coding",
+  publisher     = "Pearson Prentice Hall",
+  year          = "2004",
+}
+
+@book{biblio:INF_TECH,
+  author        = "D.J.C. MacKay",
+  title         = "Information Theory, Inference, and Learning Algorithms",
+  publisher     = "Cambridge University Press",
+  year          = "2005",
+}
+
+@misc{biblio:faultTree,
+  title         = "Reliability Workbench, Fault Tree",
+  publisher     = "Isograph",
+  howpublished	= {\url{www.isograph.com}},
+}
\ No newline at end of file
--- a/papers/ICALEPCS2011_PEER_REVIEWED/conclusion.tex
+++ b/papers/ICALEPCS2011_PEER_REVIEWED/conclusion.tex
+\section{Conclusions}
+
+
+A WRN must be considered as an ordinary Ethernet network with extra optional built-in features 
+which, when properly used, can make it robust and more reliable. This, however, comes at a price 
+of topology restrictions and redundant elements (money). The reliability study described in this 
+article and detailed in \cite{biblio:robustness} presents areas which need to be addressed to 
+increase the reliability of a WRN. The development of WR is an on-going effort and some of the 
+suggested solutions have been already properly investigated or developed (FEC, clock distribution) 
+while the others need further verification (RSTP, cut-through forwarding). 
+Suggested solutions enable to fulfill the requirements set by CERN and GSI. 
+However the costs might trigger double-checking and re-justifying of at least two of them: 
+upper-bound latency by GSI and the number of CMs lost per year.
+The former requires additional development efforts to achieve the required 100$\mu s$. 
+The latter requires a high level of network redundancy (triple or more) which is very costly. 
+Since the network topology and its reliability calculations turned out to be the greater factor in 
+the overall system reliability, it is necessary to perform more precise calculations and 
+simulations to verify the rough estimations. This might include different techniques (e.g. Monte Carlo simulations) 
+but also more real-life use cases (i.e. of the network layout suggested in 
+\cite{biblio:CERNwrControlAndTiming}, which was not available at the time of described study). 
+\modified{Especially, we need to take into account and include into calculations the fact that 
+not all the nodes connected to the WRN are equally critical in real-life applications.}
--- a/papers/ICALEPCS2011_PEER_REVIEWED/introduction.tex
+++ b/papers/ICALEPCS2011_PEER_REVIEWED/introduction.tex
+\section{Introduction}
+
+The WR project is a multi-laboratory, 
+multi-company, international effort to create a universal fieldbus for control and timing systems 
+to be used at CERN, GSI and possibly other such facilities. The rationale behind WR, 
+the choice of the technologies and technical details of its functioning have been already 
+described in a number of papers \cite{biblio:WRproject}, \cite{biblio:TomekMSc}, 
+\cite{biblio:WRPTP}. 
+%, \cite{biblio:ISPCS2011}. 
+The resilience and robustness is one of the key features of any fieldbus. 
+This article presents a study on the reliability of a White Rabbit Network (WRN) 
+assuming a basic knowledge about WR. 
+
+Reliability is defined as the ability of a system to provide its services to clients under both 
+routine and abnormal circumstances. It can be estimated by calculating the probability of 
+the system's failure ($P_f$). 
+% \begin{equation}
+%   \label{eq:reliability}
+%   R =1 - P_f
+% \end{equation}
+The lesser the probability of WRN failure, the higher its reliability. Thus, in this article we 
+identify critical services of a WRN based on the study of WR's requirements. 
+Then, we analyze each critical service to identify possible 
+reasons for their failure and propose targeted counter-measures to increase reliability. 
+Finally, their impact on the overall system reliability is studied to 
+identify the highest contributor and the focus for the further studies.
+