Identification of thresholds for accuracy comparisons of heart rate and respiratory rate in neonates

Background: Heart rate (HR) and respiratory rate (RR) can be challenging to measure accurately and reliably in neonates. The introduction of innovative, non-invasive measurement technologies suitable for resource-constrained settings is limited by the lack of appropriate clinical thresholds for accuracy comparison studies. Methods: We collected measurements of photoplethysmography-recorded HR and capnography-recorded exhaled carbon dioxide across multiple 60-second epochs (observations) in enrolled neonates admitted to the neonatal care unit at Aga Khan University Hospital in Nairobi, Kenya. Trained study nurses manually recorded HR, and the study team manually counted individual breaths from capnograms. For comparison, HR and RR also were measured using an automated signal detection algorithm. Clinical measurements were analyzed for repeatability. Results: A total of 297 epochs across 35 neonates were recorded. Manual HR showed a bias of -2.4 (-1.8%) and a spread between the 95% limits of agreement (LOA) of 40.3 (29.6%) compared to the algorithm-derived median HR. Manual RR showed a bias of -3.2 (-6.6%) and a spread between the 95% LOA of 17.9 (37.3%) compared to the algorithm-derived median RR, and a bias of -0.5 (1.1%) and a spread between the 95% LOA of 4.4 (9.1%) compared to the algorithm-derived RR count. Manual HR and RR showed repeatability of 0.6 (interquartile range (IQR) 0.5-0.7), and 0.7 (IQR 0.5-0.8), respectively. Conclusions: Appropriate clinical thresholds should be selected a priori when performing accuracy comparisons for HR and RR. Automated measurement technologies typically use a smoothing or averaging filter, which significantly impacts accuracy. A wider spread between the LOA, as much as 30%, should be considered to account for the observed physiological nuances and within- and between-neonate variability and different averaging methods. Wider adoption of thresholds by data standards organizations and technology developers and manufacturers will increase the robustness of clinical comparison studies.


Introduction
There is a high risk of mortality during the neonatal period, particularly in resource-constrained settings 1 . Continuous monitoring of neonatal vital signs enables early detection of physiological deterioration and potential opportunities for lifesaving interventions 2-4 . The development of new, innovative, non-invasive, multiparameter continuous physiological monitors specifically for neonates offers the promise of improving clinical outcomes in this vulnerable population. However, before use, these technologies should be tested in real-world situations to determine accuracy and clinical feasibility.
A neonate's marked physiological variability, small size, and often fragile condition can offer challenges when measuring and monitoring vital signs. A lack of neonatal clinical validation standards further undermines the development of continuous monitors clinically validated specifically for neonates. Determining the accuracy of new continuous monitors is an essential step in bringing these technologies to market 5,6 .
The Evaluation of Technologies for Neonates in Africa (ETNA) platform aims to independently establish the accuracy and feasibility of novel continuous monitors suitable for use in neonates in resource-constrained settings 7 . To determine accuracy and agreement, new technologies are compared against existing reference methods or technologies 8 . Before the comparison process can proceed, a clinical reference verification step is necessary to determine appropriate accuracy thresholds 7 . These a priori thresholds determine the target level of agreement required and thus, the success or failure of an investigational technology. This study describes the verification processes we conducted with a clinical reference technology in order to determine appropriate heart rate (HR) and respiratory rate (RR) accuracy thresholds to use in subsequent new continuous monitors accuracy comparisons.

Study design
This was a cross-sectional study which aimed to identify the natural variation in neonatal HR and RR in order to identify appropriate accuracy thresholds for use in an accuracy comparison of continuous monitors.

Setting and participants
Study participants were neonates admitted for observation and care in the maternity ward, neonatal intensive care, and the neonatal high dependency units at Aga Khan University Hospital in Nairobi, Kenya (AKUHN). Between June and August 2019, caregivers were approached, recruited, and sequentially screened for enrolment by trained study staff during routine newborn intake procedures. To minimize potential selection bias, all caregivers were approached in a sequential manner, as much as possible and introduced to the study using a standardized recruitment script. Final eligibility determination was dependent on medical history results, physical examination, an appropriate understanding of the study by the caregiver, and completion of the written informed consent process (Table 1).

Study procedures
The Masimo Rad-97 Pulse CO-Oximeter® with NomoLine Capnography (Masimo Corporation, Irvine, CA, USA) was selected as the reference technology based on validated oxygen saturation (SpO 2 ) accuracy measurement in neonates 9-11 . During study participation, trained and experienced study nurses attached the Rad-97 to neonates and conducted manual HR measurements (counting over 60-second epochs) every 10 minutes for the first hour and once per hour of participation thereafter, following World Health Organization (WHO) guidance for HR measurement in neonates 12 . Photoplethysmographic HR was also measured via the Masimo Rad-97 pulse oximetry skin sensor attached to the neonate's foot. RR was measured by capnography using an infant/pediatric nasal cannula to collect the neonate's exhaled carbon dioxide (CO 2 ) levels. Duration of data collection length was set at a minimum of one hour, with no upper limit. Neonates exited from the study upon discharge from the ward or by caregiver request.

Data collection and analysis
Using a custom Android (Google, Mountain View, CA, USA) application, raw data was collected from the Masimo Rad-97 in real-time through a universal serial bus (USB) asynchronous connection and parsed in C (Dennis Ritchie & Bell Labs, USA). Instantaneous HR was obtained from the timing of the pulse oximetry signal quality index (PO-SQI). The plethysmogram waveform was sampled at 62.5 Hz with the PO-SQI identified by the Masimo Rad-97 at the peak of each heartbeat. The CO 2 waveform was sampled at approximately 20 Hz from the capnography channel. The parsed output included an accurate time stamp for each entry in the waveform data output to facilitate synchronization and analysis. Data were recorded and stored on a secure AKUHN-hosted REDCap server 13 .
We analyzed the CO 2 waveform data using a breath detection algorithm developed in MATLAB (Math Works, USA) and based on adaptive pulse segmentation 14 . In addition to providing a RR, the algorithm analyzed the waveform's shape and identified the breath duration (waveform trough to trough) for each breath. From the breath duration, we calculated a RR based on the median breath duration within the epoch. We developed a custom capnography quality score (CO 2 -SQI) based on capnography features to assist with data selection. HR and RR counts and medians, along with signal quality

Amendments from Version 1
Based on helpful feedback from external reviewers, we have updated our manuscript to clarify the methods we used to synchronize the heart rate and respiratory rate data, along with the aims of the study and an updated Figure 2 to include 95% confidence intervals for the upper and lower limits of agreement.
Any further responses from the reviewers can be found at the end of the article REVISED metrics from the MATLAB signal detection algorithm, were analyzed using R version 4.0.3 15 . Capnogram waveforms were generated with two seconds added at the beginning and end of each epoch to facilitate manual breath counting within the epoch.
To ensure temporal alignment between measurements, HR and RR epochs were synchronized across source data devices. For HR, alignment was done using a timestamp in REDCap that was set by the study nurse as HR counting was initiated. Before analysis, this timestamp was synchronized with the same timestamp in the custom Android application. Both the REDCap and Android servers were connected via the internet to a Network Time Protocol (NTP) server. Alignment of RR epochs was based on the Android application timestamp. All RR waveforms were compared visually to further ensure epoch synchronization.
One of the authors (JMA, a pediatric anesthesiologist) reviewed the capnogram tracings and discarded plots with marked variability or a significant duration of an artifact that would have made breaths challenging to count. The remaining plots were provided to two trained observers to independently count all breaths within each epoch using a set of predefined rules created by the investigators ( Table 2). The two independent counts were averaged, and if the number of breaths counted by the two observers varied by more than three breaths per epoch, a third trained observer independently counted the plot, and the two closest counts were averaged.
Measurement repeatability was estimated using linear mixedeffects models based on the between-and within-neonate variability for each data source using R version 4.0.3 16 . Agreement between data collection methods was assessed using the method described by Bland-Altman for replicated observations and Automated indicator of signal quality from the plethysmographic recording.
CO 2 -SQI Algorithm-defined indicator of signal quality from the capnography channel

Accuracy
The closeness a measured value is from the true value

Repeatability
The closeness of the results of successive measurements of the same measure Agreement (between measures) The consistency between two sets of measurements Accuracy Threshold A pre-specified value used to determine if a set of measurements has achieved a sufficient accuracy when compared with a reference value

Precision
The closeness of measurements to each other reported as a mean bias with 95% confidence intervals (CIs), 95% upper and lower limits of agreement (LOA), and as a root mean square deviation (RMSD) 17 . The aim was to identify practical threshold limits using data from the clinical reference technology verification process.

Sample size
We estimated that 20 neonates with ten replications each would give a 95% CI LOA between two methods of +/-0.76 times the standard deviation (SD) of their differences. Sample size estimates for method comparison studies typically depend on the CI required around the LOA, and sample sizes of 100 to 200 provide tight CIs 17 . We aimed for a sample size of at least 30 neonates to ensure a diverse population and sufficient replications for tight CIs.

Results
Between June and August 2019, 35 neonates were enrolled, and 297 clinical observations were completed with a mean of 8.4 (SD 1.7) observations per neonate (Table 3; Figure 1) and a median data collection time of 4 hours, 5 minutes (interquartile range (IQR) 3:52-4:45) 18 . The manual HR measurements were found to have a non-normal distribution with skewness of 0.76 and kurtosis of 3.60 (p<0.001). The median manual HR measurement for all observations was 134 (IQR 126-143) beats per minute (bpm).
Moderate repeatability was demonstrated with approximately 62% (95% CI 47%-73%) of the manual HR variability being due to differences between neonates (Table 5, Figure 3A). Since the 95% CI for manual HR crossed 50%, the betweenand within-neonate variability appeared to be comparable, with neither causing significantly more variability than the other.
Manual RR from capnograms were found to have a non-normal distribution with skewness of 0.61 and kurtosis of 2.96 (p=0.027). The median manual RR measurement for all observations was 47 (IQR 39-56) breaths per minute. The manual RR compared to the algorithm-derived median RR showed a negative bias of -3.2 (-6.6%) and a marked spread between the 95% LOA of 17.9 (37.3%). The RMSD was 5.5 (11.4%). Comparing the manual RR to the algorithm-derived RR count showed a smaller bias of -0.5 (-1.1%) and a tighter spread between the 95% LOA of 4.4 (9.1%). The RMSD was 1.2 (2.5%). 1. Count peaks of the waveform that are within the white background. Ignore peaks that are within the grey background on either side of the image.

2.
A peak should be counted as a breath when the peak of the waveform is above 15mmHg, the lower horizontal blue line.
3. If the peak does not reach the lower horizontal blue line at 15 mmHg, to be counted as a breath, the peak should reach at least 50% of the mean peak.
4. The waveform should dip down to the normal baseline (either below 15 mmHg, the lower horizontal blue line, or based on other breaths). If the waveform does not reach below this point, then this is considered part of the same (double) peak and only counted as a breath once.
The repeatability was moderate with approximately 66% (95 CI 47%-79%) of the manual RR variability due to differences between neonates (Table 5, Figure 3C). Since the 95% CI crossed 50%, the amount of between-and within-neonate variability appeared similar, with neither one resulting in significantly more variability than the other.

Discussion
This reference technology clinical verification study showed minimal measurement bias with a wide spread of 95% upper and lower LOAs and similar repeatability compared with manual clinical measurements. The agreement results allowed us to identify practical HR and RR thresholds for our subsequent    technology comparison evaluation. Specifically, we identified a 30% spread between the 95% upper and lower LOA. These a priori-defined thresholds were based on variability observed ten and sixty minutes apart in the same neonate and considered the natural within-neonate physiologic variability. Variability was found to be more marked in some neonates. In part, the 30% spread between 95% upper and lower LOA was selected based on the idea that thresholds should not be more stringent than the observed physiological variability, and in part, based on results from the different averaging methods (manual RR vs algorithm-derived median RR). Given the large difference in results between the two averaging methods, considerable thought should be given prior to choosing an averaging method. A random selection of real clinical data can provide appropriate guidance for selecting suitable neonatal accuracy thresholds.
Of note, one neonate (PTID9) significantly impacted the LOA for HR. Five of nine of this neonate's manual HR measurements significantly diverged from the same epoch's PO-SQI HR values and were significantly lower than their mean PO-SQI HR, despite having acceptable signal quality scores. This irregularity suggests a HR reading or data entry error by the study nurse. Removing this neonate's data and re-analyzing it resulted in a smaller bias and tighter LOAs ( Figure 2B).
Results from this clinical verification highlight the difficulty with existing performance thresholds. Current United States Food and Drug Administration performance thresholds for HR measurement, based on electrocardiogram measurements, may not be applicable for use in neonates or when using photoplethysmography for estimating HR 19 . The current UNICEF target product profile for RR measurement technology recommends a ±2 breaths per minute threshold, which may be too stringent even for use in adults 20,21 . Using a ±2 breaths per minute recommendation with our RR data would result in a LOA spread threshold of no more than 5%, which is half the LOA spread of our best performing RR comparison.
Furthermore, a ±2 breaths per minute or 5% spread in LOA is smaller than random and natural within-neonate physiologic variability (11.5% in this study [unpublished data]) and would result in unrealistically stringent thresholds.
Selecting a performance threshold is challenging. The threshold cannot be too restrictive or inflexible, thereby stifling innovation and preventing new single or multi-parameter continuous monitors from reaching the market. However, too lax a threshold could result in an inaccurate representation of the underlying physiological state. One key limitation is that the true underlying HR or RR is unknown, regardless of the measurement method 6,22 . The primary goal of this reference technology verification study was to establish a priori thresholds as the first step of our technology comparison evaluation while at the same time understanding that the true underlying RR and HR cannot be known and also recognizing the marked physiologic variability between and within neonates.
In this study, we did not attempt to define or detect clinically meaningful events; instead, we focused on describing non-random thresholds that fall outside of normal physiological variability. We defined HR and RR thresholds based on the difference between the 95% upper and lower LOA. Additional studies will be required to determine if these thresholds translate into improved clinical outcomes.
Performance thresholds identified using this method are influenced by the characteristics of the neonates studied, the data selection methods, and the number of comparisons. For this reason, the thresholds we identified may not be applicable in different neonate cohorts, such as those receiving mechanical ventilation or immediately following birth, among others. Variability will be influenced by disturbances in the environment such as routine procedures, feeding, noise, and time of day. To minimize variability in our data set, we used only RR epochs that appeared to be regular based on visual inspection. Although these segments were selected based on predefined criteria, a majority (167/297) were discarded as the extreme variability seen in some recordings would have made reproducible manual counting of breaths impossible. We have previously demonstrated acceptable agreement between ECG derived HRV and PPG derived HRV in children with an appropriate sampling rate of the PPG. This should be validated in neonates using an ECG 23 .

Conclusion
Appropriate clinical thresholds should be selected a priori when performing accuracy comparisons for HR and RR. The magnitude and importance of sample size, as well as Figure 3. Variability plots (vertical for between-neonate variability, horizontal for within-neonate variability). Manual heart rate (HR) between-neonate variability accounts for 62% of total variability (A); median pulse oximetry signal quality index (PO-SQI) HR betweenneonate variability accounts for 75% of total variability (B); manual respiratory rate (RR) between-neonate variability accounts for 66% of total variability (C); algorithm-derived median RR between-neonate variability accounts for 50% of total variability (D); and algorithm-derived RR count between-neonate variability accounts for 66% of total variability (E).
within-neonate variability requires further investigation. A larger sample size could allow the development of an error model that more clearly describes the error due to various factors such as the measurement technology, averaging method, the observer, and the natural variability of neonatal HR and RR. We strongly support the creation of international standards for technology comparison studies in neonates. These standards should include thresholds for HR and RR based on the specific neonatal population studied and provide details of the experimental conditions, data selection methods, and analysis methods used. Together, such standards would lay the groundwork for a robust continuous monitor comparison field.

Department of Biomedical Engineering, McGill University, Montreal, QC, Canada
In this work the authors compare manual measures of heart rate and respiratory with measures obtained through the automated analysis of pulse oximetry and capnograph signals. The stated objective is to determine the threshold to be used in evaluating the accuracy of new continuous monitors. The major finding was that the different methods had low bias but large random errors. Given this finding it is not clear how the results of this paper contribute to its stated objectives. Indeed, the authors conclude that appropriate clinical thresholds should be selected a priori. I have several concerns with the paper as it stands mostly related to the methods: I had difficulty understanding exactly what measures were being compared. As I understand it: For heart rate the measures were: (1) The average heart rate over a 60 second period computed as: The reciprocal of the number of heart beats manually over the period.
(2) The median of the reciprocal of the beat to beat to beat interval computed by the Masimo device. For respiratory rate the two measures were (1) The reciprocal of the number of breaths in period determined by manual analysis of the capnograph and (2) The reciprocal of the median breath duration computed over the same period using a computer algorithm. This is correct, then the authors are not actually comparing measures of heart rate and respiratory rate but rather measures of their average values computed over a oneminute period. A difficulty with this is that this provides no measure of the accuracy of measures of either HR or RR variability. The authors need to make it clear exactly what they were comparing, justify their choice, and explain why they used a one-minute period.

1.
The authors use confidence intervals and limits of agreement derived from the Bland Altman plots to assess the differences between measures. My understanding is that the validity of these measures depends on the assumption that the differences between measures are normally distributed. Did the authors validate this assumption? 2.
The automatic analysis of the capnograph was done using an algorithm developed by the authors and described in a conference paper. There is no discussion of how this algorithm was validated or what its expected accuracy was.

4.
One of the authors' conclusions is that a wider spread between the LOA values should be allowed to account for intra-and inter-neonate variability. It is not clear to me why this should be.

Is the work clearly and accurately presented and does it cite the current literature? Partly
Is the study design appropriate and is the work technically sound? Partly

If applicable, is the statistical analysis and its interpretation appropriate? Partly
Are all the source data underlying the results available to ensure full reproducibility? Yes

Are the conclusions drawn adequately supported by the results? Partly
I think this work builds on a number of previous studies and as there are so many groups working on RR at the moment I think it would have be useful to situate this work in relation to the previous studies and articles (listed below) and discussions that have taken place, I think by showing that these results are similar it strengthens the arguments around what the apriori thresholds should be. Again these were discussed at a UNICEF meeting in 2019 and this could be referenced also -to show that these findings match global discussions.
In measuring RR we know that movement has a huge impact on the variability of RR -this is not well described in the piece. The authors mention "To minimize variability in our data set, we used only RR epochs that appeared to be regular based on visual inspection. Although these segments were selected based on predefined criteria, a majority (167/297) were discarded as the extreme variability seen in some recordings would have made reproducible manual counting of breaths impossible". Does this mean you removed RR epochs or instances where there was a lot of movement? While I agree that it is good to reduce variability I would be concerned that by removing the highly variable epochs the authors are not reflecting a true RR. Apologies if I misunderstood the methods here.
Articles to reference in the background:

Are sufficient details of methods and analysis provided to allow replication by others? Yes
If applicable, is the statistical analysis and its interpretation appropriate?

If applicable, is the statistical analysis and its interpretation appropriate? Yes
Are all the source data underlying the results available to ensure full reproducibility? Yes

Are the conclusions drawn adequately supported by the results? Yes
Competing Interests: No competing interests were disclosed.
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Author Response 10 Nov 2021
Jesse Coleman, Aga Khan University Hospital, Nairobi, Kenya Dear Dr. Drummond, Thank you for providing the opportunity to respond to your query on our updated version of the manuscript titled "Identification of thresholds for accuracy comparisons of heart rate and respiratory rate in neonates" at Gates Open Access. We appreciate your ongoing effort to review the manuscript and are grateful for your further comment. Below is our response to your query: Query: If respiratory rate was measured from exactly the same epoch, and standard criteria were used for defining a breath, is it necessarily also true that exactly the same breaths were counted? I suggest this is not the case, and this is for the reason the authors imply, but do not elucidate: the monitor-derived value is calculated from "past values" and displayed in the "present". The observer based measure is then taken from events that happen from that time forward. I would suggest a simple test of the monitors, using a simulated sample of waveforms, whose duration could be abruptly changed, say from 2 seconds to 2.5 seconds, would substantially elucidate the capacity of the device to reflect the "real now" signal that the observer has been set to observe. A diagram to show the relative times of what is measured by a monitor, and an observer, relative to the "reference mark" time, would be helpful for the reader to grasp these concepts.
Response: Thank you for highlighting this tricky aspect of breath identification and respiratory rate comparison. We were able to match the exact breath. Our team had access to the raw (instantaneous) CO 2 waveform data, recorded at approximately 20 Hz. We only analyzed the raw waveform data and counted the number of breaths. Furthermore, each 60-second epoch was isolated; no data from before or after the epoch was included in any calculation or analysis. Rather, individual breaths were counted using two different breath counting methods; 1, Study team members manual breath counting from capnograms using standardized breath identification rules, and 2, Algorithm-derived breath identification developed in MATLAB. Your suggestion about simulated respiratory rate comparisons may be an alternative if we did not have this very precise breath identification or if some method of filtering or averaging of RR was used.
The exact hypothesis of this study is hard to discern. In their abstract and introduction, the authors imply that innovative, non-invasive measurement technologies that use advanced measures of vital signs such as heart rate variation and transient deceleration (citation 2) can be used to improve outcome in infants in resource-constrained settings such as low and middle income countries, but the paper then describes a comparison of nurse observation with continuous measures available from electronic monitors, with the stated aim of defining the accuracy of methods to continuously measure physiological events. Such comparisons have been done, and they cite a substantial review (citation 4).
The introduction then ends with this statement of the study aim: "the clinical reference technology verification processes conducted to determine appropriate heart rate (HR) and respiratory rate (RR) thresholds in subsequent accuracy comparisons." However the methods then state the aim is "to identify the natural variation in neonatal HR and RR in order to identify appropriate accuracy thresholds for use in an accuracy comparison of MCPM technologies." So, we have at least three alternative study aims: the third I'd consider to be the most useful aim, comparing MCPM methods: unlikely to be answered when comparing clinicians with monitors, but could be answered with the data gathered.
At this point, I felt that some sensible and more exact definitions are required, for words such as accuracy, repeatability, agreement, threshold, precision perhaps -as stated in citation 6, by two of the authors of the present paper.
What is "Repeatability"? If we accept that the result of a 60 second counting period will differ, from one observation to the next, because the components of the measure (the duration of each breath, or the interval between photoplethysmograph pulse waves) are randomly different, then the only mechanism available to improve the estimate of the overall frequency is to increase the size of the sample: this is the law of large numbers, a statistical rule that has been known for several centuries in one form or another.
Bland and Altman, when first introducing their extremely popular method, used an example of spirometry: a single measure made first with one device and then with an alternative device. It's quite possible that two repeat FVC manoeuvres with the same device would differ: within subject variation. This is a more substantial problem in this study, as the authors state: "Furthermore, a ±2 breaths per minute or 5% spread in LOA is smaller than random and natural within-neonate physiologic variability (11.5% in this study [unpublished data]) and would result in unrealistically stringent thresholds". The degree of within subject variation is evident also from So we have a small number of intrinsically variable events. So, for a fair comparison of two methods, a necessary requirement is to ensure that the events being measured are the same, exactly the same sample has been taken. If the pulse-wave derived rate from the machine is of a different series of waves (i.e the time period is not EXACTLY the same) than those counted by the nurse, they are already going to be affected by within subject variation as well as the variation between the methods. The methods state: "Manual measures were every 10 minutes for the first hour and once per hour of participation thereafter: were the manual and monitor measures exactly timed to coincide? And, was there any time trend in the patients studied for longer times?
Of course, Bland and Altman had to subsequently refine their method, to separately account for repeated measures in multiple subjects, and at the same time they introduced the concept of confidence intervals for the limits of agreement. Looking at figure 3, there's a lot of variation: it would be helpful to plot the CI for the LOA on the Bland and Altman plots. However, I would suggest that the most useful thing to do would be to carefully analyse repeated random samples from the electronic records, looking at precise time intervals, so that the intrinsic variation could be quantified, and study how different sample sizes might affect reliability of the rate values. We have done this for respiratory rate in acutely ill adults (Drummond et al., 2020 2 ). Using 30 second periods of observation gave an interquartile range of respiratory rates of 3.4 breath/minute, whereas samples taken for 120 seconds had an IQR of 2.5. Using the techniques the authors describe here, why not sample for 5 minutes?
Availability of these records would be very useful to other workers! More analysis of the monitor records is also important since it appears that rate is not, in itself, perhaps the most important signal. For example, others have found that short-term heart and respiratory rate variability make a significant contribution to illness scoring systems (Saria et al., 2010 3 ). We apologize that we did not emphasize the importance in the original draft. We have updated our methods section to clearly describe the synchronization methods we used to ensure that all data was precisely temporally aligned. The new wording will read as follows: "To ensure temporal alignment between measurements, HR and RR epochs were synchronized across source data devices. For HR, alignment was done using a timestamp in REDCap that was set by the study nurse as HR counting was initiated. Before analysis, this timestamp was synchronized with the same timestamp in the custom Android application. Both the REDCap and Android servers were connected via the internet to a Network Time Protocol (NTP) server. Alignment of RR epochs was based on the Android application timestamp. All RR waveforms were compared visually to further ensure epoch synchronization." With the definition and clarification made, we feel that testing the repeatability and agreement of the two methods is reasonable.

Point 4: And, was there any time trend in the patients studied for longer times?
Response: Thank you for asking this question. We do have respiratory rate data on patients studied for longer times that has been submitted and is currently under review elsewhere.
Point 5: Of course, Bland and Altman had to subsequently refine their method, to separately account for repeated measures in multiple subjects, and at the same time they introduced the concept of confidence intervals for the limits of agreement. Looking at figure 3, there's a lot of variation: it would be helpful to plot the CI for the LOA on the Bland and Altman plots. Response: Thank you for the helpful suggestion. We will be updating the Bland and Altman plots to include the CI for the LOA throughout.
Point 6: However, I would suggest that the most useful thing to do would be to carefully analyse measures of heart rate variability between ECG and plethysmography in children. This agreement is dependent on an appropriate sampling rate of the plethysmogram. Noting your comment, we will add the following text to the discussion: "Pulse plethysmography may not be an accurate measure of HR variability due to innate technology limitations. Future studies looking at HR variability should consider using ECG monitoring, despite having its own limitations. 1  Competing Interests: None