## 1 Introduction and Background

### 1.1 Spectroscopy techniques at European XFEL

Spectroscopy is central to the natural sciences and engineering as one of the primary methods to investigate the real world, study the laws of nature, discover new phenomena, and characterize the properties of substances or materials ().

However, experiments performed at photon science instruments in large facilities, like synchrotrons or XFELs always generate a large volume of data. As an advanced source, European XFEL is providing new research opportunities to scientists from different domains as diverse as physics, chemistry, geo- and planetary sciences, materials sciences or biology (). The unique feature of the European XFEL is the capability of the superconducting linac to accelerate trains of many (up to 2,700) electron bunches within one 600-µs-long RF pulse at 4.5MHz () with an electron energy of up to 17.5 GeV, which means that X-ray pulses can be generated with only 220ns separation in time and a maximum of 27000 pulses per second. Combining them with fast 2D detectors, a huge amount of data is generated in a short time ().

At the European XFEL, currently, six scientific instruments (e.g. SCS (; ), FXE (; ), and HED ()) are in operation for specific applications using soft or hard X-ray techniques (). Among them, the High Energy Density (HED) scientific instrument focuses on the investigation of matter at high density, temperature, pressure, electric, and/or magnetic field (). It offers a wide range of time-resolved X-ray techniques reaching from diffraction and imaging to different spectroscopy techniques for measuring various geometric, electronic and magnetic structural properties ().

In HED experiments with high-density materials, changes in pressure will cause changes in the measured spectral peaks (making them newly appear, vanish, shift, or split). In order to evaluate the state of the experiment, the measured spectra need to be classified so that each class is assigned to a different state of the system under investigation. The two major spectral changes that we aim to capture in this study are

- the change of intensity distribution (e.g. drop or appearance) of peaks at certain locations, or
- the shift of those in the spectrum.

The complexity of the experimental setup and the high data rates at the order of 10–15GB per second per detector at European XFEL () demand an efficient approach of combining experiments with online data analysis as an active feedback system with low feedback latency (; ; ; ).

Under such circumstances, it is crucial to properly use and reuse the previously collected big data, and also efficiently analyze, and accurately extract meaningful features during experiments. While data analysis algorithms that properly solve the relevant physics equations and optimization steps would be hopeless to perform in near real-time, Machine Learning based algorithms can provide a solution.

### 1.2 Machine Learning and its requirements

Machine learning (ML) is a subfield of artificial intelligence centered on algorithms that can learn to complete tasks using data (). It applies broadly to a collection of computational algorithms and techniques that train systems from raw data rather than a priori models (), thus useful for research facilities that produce large, multidimensional datasets like European XFEL. With recent developments in ML, data-driven deep learning methods have turned out to be very good at discovering patterns and structures in high-dimensional data () during the training process. The trained models can then supply computational inexpensive decisions when applied on new data. Hence, this technique opens up new ways for data-driven analysis in spectroscopy by offering the possibility of recognizing specific features on-the-fly during data collection. It enables fast feedback and the implementation of near real-time downstream data analysis, e.g. providing immediate information on the phase of the experiment based on the classification of the currently measured spectra. In order to train the model more effectively and accurately, ML requires lots of data which can only be supplied on a machine readable way if the experimental dataset are stored in a standard and well understandable format with clear definitions and annotations. It is important that research outputs should align with the **FAIR** principles, meaning that data, software, models, and other outputs should be Findable, Accessible, Interoperable, and Reusable (). The key in finding and re-using digital assets is metadata (). For data re-use, their annotation and assignment to metadata with ontology definitions are very important. In the case of synchrotron or XFEL experiments, data collection and management processes of the experiment control systems shall be prepared to acquire all necessary metadata in a hierarchical data structure following a scientific experiment data model. Such an experimental data model shall connect the instrumentation and data acquisition to the physics model behind the respective experiment. The NeXus data format and ontology platform (; ; ) provides an ideal environment for this. Whether the experiment is performed in a synchrotron or an XFEL facility, the corresponding ML algorithms could obtain the information always in the same machine readable format, and could easily and automatically pick up the data source to be processed during training. This requires all experiment-related metadata including those needed for processing, as well as those needed for filtering the available datasets to be captured and stored according to the ontology definitions and relationships set by the data model of the respective experiment type.

Unfortunately, in the case of XFEL sciences, the scientific data are not well annotated, and many key information are not machine-findable or accessible. Specifically, in our use case, the HED experimental spectra data we obtained lacked some key information, such as the pressure values and were provided with wrong metadata e.g. on the energy of the incident beam. Therefore, the selection and the preparation of the training dataset for our ML model required lots of manual work.

### 1.3 Scientific experiment data model

To comply with FAIR principles and facilitate ML training, experimental data shall follow experiment data models which connect the experimental data items to the physics model used by the design of the experiment and also by the corresponding data processing applications. As the physics models and data processing applications can be different from experiment to experiment, the corresponding data models need to be defined for each experiment type respectively. Similarly to physics models which can be extended and reused during the investigation of related physics phenomena, scientific experiment data models shall also be modular and shall reuse base definitions.

The NeXus glossary defined using the NeXus Definition Language (NXDL) offers several base classes and allows annotating any data items stored in the NeXus format by connecting them the glossary definitions. NXDL supports modular extension of these base classes and declaration of application definitions. These form complete data models covering all the requirements of data processing applications for the respective experiment types. NXDL also provides an ontology which is based on the relationships declared between the defined glossary terms. Hence, scientists can define a schema for an application data model which connects all data fields in a NeXus data file to its machine-readable meaning (). In this way, an experiment model can be expressed in an NXDL application definition and all the corresponding data and metadata can be stored in a file following the NeXus data format standard.

A good example for an application definition specifying a data model for a given scientific domain and so supporting data reuse is NXmx (NeXus for macromolecular crystallography), one of several standardized NeXus application definitions (). Recording data and metadata in such an annotated data exchange format can guarantee automatic and proper data interpretation and support data reuse.

### 1.4 Significance of the study

As described above, it is important to optimize the usability and reusability of the scientific data collected during XFEL experiments. A near real-time and accurate extraction of meaningful features during the experiments is vital for feedback systems, and essential for assessing the state of the experiments. The main contributions of this study are two-fold:

Firstly, we show an example how ML-based artificial intelligence (AI) methods can provide near real-time feedback on the status of the experiments and trigger a conceptual breakthrough towards data-driven spectroscopy (). For this purpose, this study demonstrates a machine learning based statistical model which is applied for spectra classification. The presented solution automatically finds the regions (or bins) where the classes differ significantly in our training set (that is what we call separability). For this step, a two-layer neural network is applied to train our classification model of each bin separately and the obtained separability is used to calculate classification weighting factors for each bin respectively. By calculating the weighted sum of the ML based predictions obtained for the points of a spectrum (corresponding to the diffractogram in a certain experimental state), a class label is assigned finally which suggests the actual state of the experiment. We also investigated the performance of the model using different number of bins. We have found a relationship between the number of bins and the size of the ambiguous region in which the prediction of the model is not reliable. As a result an optimal number of bins can be determined as a compromise between the computational complexity and an expectable classification confidence.

Secondly, we highlight the importance of FAIR data management in XFEL experiments, and suggest the implementation of individual NeXus application definitions for the different experiment types. Specifically, for future HED spectral data, we recommend introducing a specific application definition which allows a machine-readable and machine-interpretable registration of important metadata, like the pressure and wavelength values alongside the raw detector data.

## 2 Experiment Data

Spectral data was collected during an HED experiment performed at PETRA III Extreme Conditions Beamline P02.2 using the LAMBDA GaAs 2M detector system (a multi-megapixel hard X-ray detector to measure the scattering intensity for experiments at synchrotrons), as described in (). A simple sketch of the experiment setup is shown in Figure 1. The frame rate of the detector enables measurements up to 2000 frames per second, giving the possibility of 0.5 ms period per image. The sample used in this study was (Mg_{0.2}Fe_{0.8})O magnesiowüstite which is a solid solution of the endmembers periclase (MgO) and wüstite (FeO). The sample was compressed from 1 to above 100 GPa over a period of 50 s while maintaining reasonable diffraction image quality in consecutive 100ms exposures. The measurement was performed at 25.6 keV beam energy (corresponding to the incident beam wavelength of 0.4828Å). Diffraction images (as in Figure 2) were continuously collected every 100 ms during a trapezoidal ramp profile, tracking the pressure evolution of the sample and the Pt (platinum) pressure calibrant. Azimuthal integration and baseline subtraction () are applied to 2D raw diffraction images to obtain 1D integrated spectra. Figure 2 shows an example of how the spectral peak positions in a spectral curve reflect the intensity distribution in the corresponding raw diffraction image as a function of the scattering angle. A peak in the diffiractogram or spectrum corresponds to a diffiraction ring ().

The obtained spectral data collected during the experiment is shown in Figure 3. To show more clearly how the diffraction changes while the pressure on the sample is changing, we show one for every 20 diffiractograms. Hence, the time interval between adjacent diffiractograms in the Figure 3 (a) is 2 s. As it can be clearly seen from this figure, the amplitude of the spectral peaks change (increase, decrease, or even vanish) at certain locations, and peaks also shift in their 2θ-angle position, split, or start to broaden.

The changes correspond to the modification of the crystal lattice (e.g. indicating phase changes). During the experiment, we should be able to follow these changes and determine the actual state of the system in near real-time. Scientifically, the most relevant question is whether the phase transition in the sample has already happened. To determine this using Machine Learning, we selected representative measured spectra (and also produced simulated spectra) at both the initial and final stages. Based on this input, we have to provide a judgment at each point during the experiment with the minimum ambiguity.

In this experiment, the most important external condition that causes phase transition is the change of pressure. One of the Raw data files produced during the HED experiment is shown in Figure 4. It recorded data from one module of the three-module detector system and stored it in hdf5 format. Note that NeXus base classes were used by the facility to store the data, but no application definition was followed which would define where to find certain experiment settings as metadata for this experiment type (e.g. pressure, or incident X-ray beam energy). The root element, an NXentry group represents one experimental scan or run (). Some instrument settings can also be retrieved from the base class definitions as of NXdetector, for example, detector acquisition mode, calibration data, geometry, and translation information, but lots of metadata including the incident beam energy appears only as free text key-value pairs under the group NXcollection without any definition and link to a physics model. As a consequence, no data processing application can rely on the generated file to automatically find these settings and use them. Instead, beamline scientists can help with a manual interpretation of the dataset. Even more, it has turned out that the free-text key-value pair provided for human interpretation “beam energy – 25000” is misleading, as the incident beam energy was actually 25.6keV and not 25000*[eV]* as suggested by the certainly unmaintained and undocumented entry above provided with no units. Note that without knowing the beam energy, no conversion between diffraction angle and Q-space is possible which blocks the possibility of proper calibration and determination of key information such as the pressure corresponding to each diffiractogram. As a result, we cannot automatically select the training set for our ML model, but rather picked up spectra for the training manually.

## 3 Example for data reuse

Machine learning with its big data demand provides a perfect use case for data reuse. In ML, a classification algorithm is usually considered a type of supervised learning, which is used when the outputs are restricted to a limited set of categories or values.

In the case of supervised learning, the training process will tweak the parameters of the model for correcting the output towards the one defined as a good solution. It infers a function from a labeled training dataset (). The trained mathematical model can then be used as a classifier to predict the class of previously unseen instances. Hence, after training the model using spectral data belonging to either the state before or after the phase transition, the neural network can classify each newly acquired spectrum during data collection and determine the actual state of the system under investigation during the experiment.

### 3.1 Mathematical model applied for Machine Learning

#### A. Notation and terminology

In this study, scalar value variables are denoted as normal font letters (i.e. *x* or X), matrices are denoted as bold capitals (i.e. **X** or **W**), vectors are denoted as bold lowercase letters (i.e. **x**). Variables with subscripts represent elements in a vector or matrix, for example, we use *x _{i}* and

*x*to indicate the element in vector x and matrix

_{ij}**X**respectively. The transpose of a matrix or vector is represented by the symbol ()

*, and ()*

^{T}^{(i)}is associated with the

*i-th*data item in the training set.

#### B. Supervised learning for classification

Given the inputs, **X** ∈ ℝ* ^{n×m}* representing

*m*trainning samples (or observations) with

*n*data points in each can be denoted as

The training samples are stacked up in the matrix. Each sample has *n* data points or features, which is often called *feature vector*. The *i-th* training example can be denoted as

The corresponding output or classification response **y** ∈ ℝ^{1×m} is the class labels or target values for **X**, represented as

where each element *y ^{(i)}* in the output vector

**y**corresponds to the input vector

**x**. And the input-output pairs (

^{(i)}**x**

^{(i)},

*y*

^{(i)}) constitute the training data set, S,

The function *f* to convert from **X** to **y** is the model that the ML algorithm needs to learn to accurately predict the outcomes. In most cases, the observation data **X** contains noise and may also contain contradictions. Hence, it is usually impossible to find a function that can accurately classify all training samples. The goal of ML is to develop a model *f* that can best match each training example (**x**^{(i)}, *y*^{(i)}), so that *f* can be used as a predictor for new data or observations x^{(new)}. Modeling the inaccuracy as noise, the generic data model for ML can be described as

where ɛ represents discrepancies such as measurement errors or other inconsistencies.

It yields the prediction for new data x^{(new)} as

During the process of training the model, its parameters (which can be expressed as parameter vector θ ∈ ℝ* ^{n}*) are adjusted. Therefore, the vector may be estimated by solving a (convex) optimization problem ().

A loss (or cost) function *l*(∙) is used to evaluate the performance of a neural network by calculating the error between the actual value *y*^{(i)} and the predicted value *ŷ*^{(i)}. It is a very import indicator for monitoring the training process. Cross-entropy loss is the most commonly used loss function for multi-class classification problems, expressed as

where C is the number of classes, ${\widehat{y}}_{k}^{\left(i\right)}$ and ${y}_{k}^{\left(i\right)}$ are the predicted probability if the *i-th* data point belongs to the class *k* and the One-Hot Encoded ground truth respectively.

Averaging the loss over the whole training set *S* yields to the training loss *J* as

The goal of training the neural network is to minimize the value of this loss function. Therefore, the process of model training is transformed into the process of solving the minimum value of the training loss *J*(θ), which is described as

After this calculation process, the best mathematical parameters of the function *f* on the training set *S* will be obtained, and then the output *y* for any given input **x** can be predicted by the formula $\widehat{y}={\theta}^{T}x$ .

#### C. Deep learning

The performance of most of the traditional ML algorithms depends on how accurately the features are identified and extracted (). Deep learning is a branch of ML algorithms () which offers an efficient method for learning features via applying multiple layers of neural networks which extract different features at each layer. The initial data is provided to the network in its visible input layer, while the different abstract features are generated in the subsequent hidden layers. Hence, a hierarchical feature extraction derives here finally an abstract representation.

##### 1) Fully connected neural networks

With the advantage of being “structure agnostic”, fully connected networks (as shown in Figure 5) are capable of learning any function which boosts them to become a universal learning architecture, and are regarded as the workhorses of deep learning.

A fully connected neural network consists of a series of fully connected layers (), which is a mathematic function that describes its input-output relationship. Consider the case of one training example x^{(i)} for any layer *l* the relationship between its input and output can be described as

where ${a}_{j}^{\left[l\right]\left(i\right)}$ represents the output of the *j-th* neuron ${u}_{j}^{\left[l\right]},\hspace{0.17em}\hspace{0.17em}{a}^{\left[l-1\right]\left(i\right)}\in {R}^{{n}^{\left[l-1\right]}}$ represents the input vector from the previous layer *l*-1, ${w}_{j}^{\left[l\right]}$ is the weight vector associated with inputs of neuron ${u}_{j}^{\left[l\right]},\hspace{0.17em}\hspace{0.17em}{b}_{j}^{\left[l\right]}$ is the bias associated with neuron ${u}_{j}^{[l]}$ , *n*^{[l]} is the number of neurons in layer *l*, and *σ*(∙) is an activation function as explained below.

Consider a standard L-layer fully-connected neural network with all the training examples X we can denote the input layer as ${A}^{\left[0\right]}\in {\mathbb{R}}^{n\times m}({A}^{\left[0\right]}=X)$ , the outputs of the hidden layer *l* as ${A}^{\left[l\right]}\in {\mathbb{R}}^{{n}^{\left[l\right]}\times m},l=1,\hspace{0.17em}\hspace{0.17em}\dots \hspace{0.17em}\hspace{0.17em}L-1$ , the very last, so called output layer as ${A}^{\left[L\right]}\in {\mathbb{R}}^{{n}^{\left[L\right]}\times m}$ . At each layer, weights are introduced as ${W}^{\left[l\right]}\in {\mathbb{R}}^{{n}^{\left[l\right]}\times {n}^{\left[l-1\right]}},\hspace{0.17em}\hspace{0.17em}l=1,\hspace{0.17em}\hspace{0.17em}\dots \hspace{0.17em}\hspace{0.17em}L$ , together with bias vectors ${b}^{\left[l\right]}\in {\mathbb{R}}^{{n}^{\left[l\right]}},l=1,\hspace{0.17em}\hspace{0.17em}\dots \hspace{0.17em}\hspace{0.17em}L$ which yield to the generic calculation model in the network:

where *h*(∙) and *g*(∙) are some nonlinear activation functions. The commonly used activation functions for hidden layers are ReLU (), sigmoid function (), and hyperbolic tangent (). In our study, we use ReLU, which can greatly accelerate the learning speed of the classification model. For the output layer, *g*(∙) always represents the softmax function (), which provides the probability distribution of each outcome or label over C classes ${\widehat{y}}_{k}^{\left(i\right)}$ , defined by the formula

where ${\text{z}}_{k}^{\left(i\right)}$ (also called logits) represents the score of the *i-th* data point belonging to class *k* as obtained in the output layer. Note that in our classification model the number of neurons in the last layer (layer L) is equal to the number of classes (*n ^{[L]}* =

*C*). The predicted class

*Ĉ*

^{(i)}(label) for the

*i-th*data point is obtained as

Unfortunately, parameter vector **θ** in deep fully connected neural network models (composed of **W*** ^{[l]}* and

**b**

*for layer*

^{[l]}*l*) makes the computational complexity grow exponentially with the number of layers, and at the same time slows down the training process and increases the chance of overfitting, so the number of such layers is limited in practice.

##### 2) Gradient descent optimization algorithm

Gradient descent is one of the most popular algorithms to perform optimization and by far the most common way to optimize neural networks (). Learning rate is a parameter of the training process, which determines the step size to reach the local minimum of the objective function *J* (equation (8)). Choosing its proper value can be difficult. A learning rate that is too small will lead to very slow convergence, while a learning rate that is too large can hinder convergence and cause the loss function to fluctuate around the minimum or even diverge ().

To overcome this, the Adaptive Moment Estimation method (Adam) () computes adaptive learning rates for each parameter and also adds bias correction and momentum to RMSprop, the Root Mean Square Propagation algorithm, a similar but simpler algorithm (). While Adam, RMSprop and also Adadelta (), an extension of Adagrad (), are very similar algorithms, Kingma et al. () show that Adam carries the benefits of all, and it can actually slightly outperform towards the end of optimization as gradients become sparser ().

Using Adam as the optimizer, the learning rate doesn’t need to be manually adjusted at different stages, but only a default value has to be provided.

### 3.2 Spectra classification

In this work, we give an example of using experimental data to determine if our multi-crystal powder system has already passed a phase transition during a pressure ramping experiment protocol or not.

Since pressure was not recorded directly with the raw dataset, a training set containing elements from both phases could not be automatically identified, but rather the representative curves had to be selected manually. For this purpose, four representative spectral curves (2 from both classes) are included in the training set (as shown in Figure 3). To learn the characteristic differences between the 2 classes, Neural Network-based ML is used. To increase the robustness of the neural network model and its tolerance to small Gaussian measurement errors, 10 simulated spectral curves for each original spectrum were also included in the training set by adding some random noise, which is small enough to enable finding the characteristic features in the spectra. The random noise is generated using the Mersenne Twister () as the core generator. The spectral data used for the training can be seen in Figure 7.

As described below, with the introduction of a new metric *separability* learnt from the final accuracy of the trained neural network model in each bin, we assign *believability weighting factors* to the bins to describe how much the classification prediction of the ML model in the given bin can be trusted. A weighted average of the individual predictions from all bins yields the final classification prediction for any new spectra.

#### A. Neural Network Structure

We choose neural networks due to their ability to learn complex mappings between input and target spaces which makes them perfect for our task. Neural network models have gained increased popularity recently, since they can express complex function mappings using inputs with very little or no feature engineering (). In this study, we explored a two-layered neural network architecture with 150 hidden units. Each layer accepts the output of the previous layer as its input, and returns a transformation function as its output (). We transform our problem of finding and distinguishing features in a set of 1D curves where the input is provided by the sequence of the spectral intensity values, into a 2D segmentation problem where we input every point in the spectra with their 2 coordinates (scattering angle, and azimuthally integrated intensity as in Figure 2). Hence, our input layer has 2 input neurons (corresponding to the coordinates of the given points in the 2D space) and the hidden layer has 150 hidden neurons (as an abstract feature map) followed by a non-linear ReLU function. Their outputs are fed into the second layer (which is the output layer) with 2 output neurons (corresponding to the number of classes being measured before or after the phase transition), followed by a softmax activation function, as in Figure 6. The ReLU activation units are selected here to speed up the model learning and prevent gradient vanishing/exploding ().

Such a neural network described above can be applied for the general 2D space segmentation problem as discussed in (). Here, we investigate the application of the same NN architecture for the special case of real experimental spectra classification.

#### B. Model Training and believability weighting factors

As Figure 7 shows, the spectra in the two training classes partially overlap. Hence, classification (or segmentation) at the overlapping parts is difficult, meaningless, and of low confidence. The more our spectra overlap, the lower the overall accuracy, independently how precisely our model performs the segmentation at locations with well-separable features. Because of this, we divide the spectrum into several intervals, called bins, so we can learn the local separability of the spectra in each bin separately, which is provided by the classification accuracy at the end of the training. The principle of selecting the number of bins is to ensure that at least one spectral bin has a high separability indicated by the classification accuracy. Figure 7 shows the binning for the case of dividing the spectrum into 18 bins.

After splitting the spectra, two-layered neural networks with the identical structure described above is used to perform classification training in each bin separately. The obtained classification accuracy as a separability indicator is used for automatically calculating a believability weighting factor for each bin (equation (15)).

The neural network is trained by backpropagation using gradient descent, with the Adam update scheme as described above. We use the cross-entropy as the loss function, ReLU and Softmax function as the activation function for our classification task in each bin. The statistical model is obtained by minimizing the loss function on the training data set by checking if a point is properly classified. 700 epochs are used for iteration, in which the point-wise probabilities ${\widehat{y}}_{k}^{\left(i\right)}$ and then the class id *Ĉ*^{(i)} ∈ {0, 1, …, *C* – 1} is predicted for each point by applying equation (12) and (13). The final classification (or space segmentation) model in each bin is shown in Figure 7 by the two background colors. As the figure inlet also highlights, the segmentation is determined by the point distribution of the spectra in each bin, and certain data will inevitably be misclassified.

In this process, the training accuracy of classifying each point in each bin is obtained. The corresponding statistics are shown in Figure 8. The optimized accuracy as a quality index for our neural network model is used as a measure for separability in each bin. High-quality classification is obtained in bins with high separability, whereas the point-wise classification in bins with low separability is misleading.

In this work, we use *A _{b}*, {

*b*= 1, …

*B*} to represent the training classification accuracy and so the spectral separability in bin

*b*where

*B*represents the number of bins. The values of

*A*an vary between 0 and 1. Since 50% classification accuracy corresponds to maximum ambiguity, we consider that an accuracy below 55% corresponds to a confidence weighting factor of 0, and an accuracy of 100% corresponds to 1. To increase the impact of the training accuracy, a square operator is also introduced. Based on this, a

_{b}*believability weighting factor w*is calculated as

_{b}The result shows (see Figure 8) that the more obvious the feature difference in bins, the higher the classification accuracy (separability), thus larger weighting factors. At the same time, it can also be seen that the weighting factors corresponding to non-separable regions with accuracy less than 55% are 0.

The final classification label for each spectral curve is calculated by the weighted sum of the point-wise classification results assigned by our neural networks in each bin:

where ${\widehat{C}}_{k}^{\left(i\right)}$ equals to 1 if the *i-th* feature or observation *Ĉ*^{(i)} is labeled as *k* otherwise it equals to 0. ${N}_{B}={\scriptscriptstyle \frac{N}{B}}$ represents the number of data points in each bin, where *N* represents the number of data points in each spectral curve. Based on the classification prediction *L _{curve}* for any measured spectrum, an immediate information on the phase of the system, and so on state of the experiment can be provided.

#### C. Performance metrics

Next to classifying the spectra into 2 phases, our aim also includes finding a transition point during an experiment. It is difficult and inaccurate to manually determine such a boundary. Instead, we are interested in the robustness of the ML-based classification and want to minimize any ambiguity zone during the experiment when the classification prediction jumps inconsistently between the two phases. Hence, our performance metric of the classification model shall show how small such a zone of ambiguity is. From the physics point of view, a proper interpretation would require the phases and the ambiguity zone to be linked to specific pressure ranges. Unfortunately, the available data is not complete and does not contain such information, so an objective measure for the classification quality is not available. This also highlights the importance of storing all relevant metadata and following FAIR principles in data management.

Instead, we define the classification confidence as described here. Let *N _{f}* represent the number of spectral curves in ambiguous region,

*N*represents the total number of all test spectral curves, then the classification confidence can be defined as

_{t}If there is a clear division between the two types of spectra, it is considered to be 100% confident. In the particular case, if a phase change has not been found and all spectra are assigned to the same class without any boundary between the phases, it is considered that all spectra are in the ambiguous region and the classification confidence is 0.

### 3.3 Classification Results

The full dataset used for the evaluation of the neural network learning model consists of 349 spectral curves, each of which has *N* = 4023 data points. Part of the dataset (one for 20 diffiractograms) is shown in Figure 3 as described above.

According to the performance metric equation (17), a curve-wise classification confidence for the dataset can be calculated. Applying 18 bins, the ambiguous zone is small and the final curve-wise classification confidence is 98.854%. For different number of bins applied, Figure 9 shows the respective ambiguous zones, and Figure 10 plots the calculated values of classification confidence. Figure 9 shows the ambiguous zones around the phase change by displaying the spectra with colors according to the classification prediction results and applying a continuous movement of the intensity baselines for better visibility.

It can be clearly seen that the more bins applied, the higher the classification accuracy, thus smaller the ambiguous zone. This is achieved by weighting the individual classification predictions calculated at the different bins, and so making the bins containing well-separable spectral features dominant and weakening the impact of misclassifications in the bins with low spectral separability and corresponding low believability weighting factor. The key in achieving a high classification confidence is finding bins with high separability index. Therefore, the optimal number of bins should be determined by increasing the number of bins until it is ensured that bins with high separability have been found.

The relationship between the classification confidence and the number of bins applied is shown in Figure 10. The overall trend is that as the number of bins increases, the ambiguous region becomes smaller and smaller. The presence of bins with high separability and corresponding high believability weighting factors immediately shows that the characteristic features of the different classes have been captured and high overall performance is expectable.

## 4 Conclusion

In this study, we presented an example of classifying experimental spectra using Neural Network based ML. We highlight the shortcoming of missing metadata and propose a solution for overcoming this by applying NeXus glossary and data format standard in XFEL sciences. Data reuse has been presented with a focus on classifying the measurement spectra into two classes either being before a phase transition or being after it. The classification is based on spectral features specific to an experimental phase. Hence, changes in spectral peaks (increasing, decreasing, vanishing, shifting) had to be found. Since the spectra of the 2 classes are overlapping in some regions, their separability is low there. We have presented that binning the data, and down-weight the classification predictions from those bins where the classification accuracy is low results in a reliable final classification. In each bin, a two-layer neural network was trained for the classification and also for determining the separability of the classes within the bin. Hence, a weighted sum of the predicted class labels of each data point from a whole spectrum leads us to the final classification of the curve. This can minimize or even eliminate the effects of misclassifications of the data points in overlapping regions.

The result shows that our spectral classification model is robust to random noise and can identify peak intensity changes or peak shifts. The classifier can provide us with an immediate feedback on the spectral class and so on the actual phase of the experiment.

## 5 Perspectives and Recommendations

Although we obtained nice result on ML based spectral classification, the process of selecting/creating the training set is still limited, because the data is not properly annotated and some key metadata (such as incident beam energy, or the pressure value for each spectral curve) is missing. To improve the situation and ensure that data aligns with FAIR principle, a specific experiment data model shall be built to connect the instrumentation and data acquisition entries to physics models, so ML can be set up much easier and more efficiently because all required metadata is registered for proper interpretation without the need for personal consultation. Since the diffractograms in our study are measured in a course of pressure ramping, an appropriate NeXus application definition should have been made to define all relevant metadata entries that need to be recorded. Experiments recording data and metadata according to such experiment data model would then help data interpretation, especially machine readability, and reuse.

In our ML algorithm, the weighting factors are calculated based on the training accuracy in each bin. As a next step, the Neural Network can be extended to form an end-to-end ML structure which can automatically learn these weighting factors and output directly the final classification label.

Another development possibility is to handle the spectral data as a set of one-dimensional time series and instead of a space-segmentation approach presented here, the classification problem could check for intra-spectrum characteristics, too. In this setting, we can apply different supervised deep learning neural network architectures for the spectra classification, such as the convolutional neural network (CNN), Residual Networks (ResNets) (), LSTM-based architecture (), attention-based architectures(). At the same time, unsupervised learning such as clustering is also very suitable for such classification tasks.