BLOG | Samsung Research

PIX-TAB: Efficient PIXel-Precise TABle Structure Recognition Approach with Speculative Decoding and Region-Based Image Segmentation

Invalid Date

Introduction

Tables serve as a fundamental means of organizing and presenting structured information in various documents, including scientific papers, financial reports, web pages. Automatic recognition of table structures, including rows, columns, and cell relationships, is crucial for information extraction, data analysis, and document understanding [1, 2]. In this paper, we introduce an efficient PIXel-precise TABle structure recognition (PIX-TAB) approach with speculative decoding and region-based image segmentation (Fig. 1) inspired by MTL-TabNet architecture. A key advantage of our approach is that it gives precise, pixel-level structure using a small and fast neural network capable of on-device execution, while staying flexible: adding support for new language simply requires replacing the Optical Character Recognition (OCR) model without any modifications to the core structure recognition model.

Figure 1. The overview of the proposed approach.

Our contributions are summarized as follows:

We propose a compact table representation employing Position-Aware Pixel-Precise Tokens, allowing model to use this information during table decoding, which especially beneficial for more accurate recognition of long and complex tables.

We introduce an analytic Speculative Decoding to speed up the sequence generation and keep the same recognition accuracy, reducing decoding time and improving on-device responsiveness.

We introduce the TEDS_struct100 and TEDS₁₀₀ metrics to overcome the shortcomings of existing TSR accuracy measures.

We propose a Region-Based image Segmentation using flood fill technique using 8-connectivity, breadth-first search and results filtering to enable reliable cell detection for tables with well-defined geometric boundaries.

We address the limitations of existing datasets by proposing a comprehensive approach for generating synthetic data based on Wikipedia tables with enhanced structural diversity and visual variance, producing over one million annotated table images for robust model training.

Overview

The proposed PIX-TAB approach consists of four parts:

an encoder-decoder model that predicts position-aware pixel-precise tokens and OTSL tokens;

a region-based image segmentation (RBIS) module that predicts OTSL tokens and bounding boxes;

an external OCR model that predicts text;

an aggregation module that selects and combines results from all other modules using hybrid selection strategy.

Position-Aware Pixel-Precise Tokens

To enable deterministic reconstruction of table cells, we extended the structured OTSL representation [4] by adding explicit row- and column-position tokens. Thus, for a normalized table image of size X×Y, the position aware pixel-precise (PAPP) tokens are constructed as follows:

Row start tokens <rYYY>, where YYY [0,Y), indicate the vertical pixel coordinate of each horizontal table line;

Column boundary tokens <cXXX>, where XXX [0, X), indicate the horizontal pixel coordinate of each vertical table line.

These PAPP tokens are mixed with four structural OTSL tokens: ''C'' (cell), ''L'' (left-looking), ''U'' (up-looking), ''X'' (cross).

We did not employ ''NL'' token from the original OTSL because the start of a new row is already indicated by <rYYY> token. In addition to that, we terminate the sequence with the </table> token to mark the end of the table. As it can be seen from Fig.2, the proposed table representation is considerably more compact than the equivalent HTML markup. It is only marginally larger than a pure OTSL representation due to the inclusion of the first row tokens.

Figure 2. Comparison of table representation in (a) our proposed PAPP and OTSL tokens; (b) HTML tokens for the same table.

Model Architecture

The detailed architecture of the encoder-decoder model is presented in Fig. 3.

The encoder-decoder model consists of four main components:

An enhanced ResNet backbone encoder.

A Shared Decoder that provides features for structure decoder (StructDecoder) and bounding box decoder (BboxDecoder).

A StructDecoder – the transformer decoder predicting PAPP and OTSL tokens.

A BboxDecoder – the lightweight auxiliary bounding-box head used only during training.

Figure 3. Encoder-decoder model architecture

Speculative Decoding

Due to the specifics of the TSR and use of the PAPP-OTSL tokens instead of bounding boxes during inference, we can generate hypotheses for speculative decoding [5] with an analytical algorithm instead of using another decoding model. The PAPP-OTSL sequences are highly regular across rows. We exploit this regularity to suggest a sequence of future tokens and reduce the number of decoder steps. Our decoder fills the table row by row. After the first row there are no <cXXX> tokens. Each next row starts with <rYYY>, following by the sequence of the OTSL tokens. Together, this makes it possible to suggest a sequence of future tokens with simple rules, without running any extra neural network. Fewer decoder steps means lower latency.

The speculation itself is pure token-level manipulation with a computational cost of O(K×N_cols) per trigger without extra model calls. In practice this operation is negligible compared with a single decoder step, yet it often removes a lot of decoder steps for regular tables.

Region-Based Image Segmentation

While the encoder-decoder model (EDM) serves as the primary method for TSR in our system, it often struggles with large tables that have complex layouts – a scenario that is very common in real enterprise documents. To address this limitation we introduced a region-based image segmentation (RBIS), which handles such tables more effectively. Our approach employs the EDM as the primary method and runs the RBIS in parallel for tables that have complete borders. Both methods output detected cells in HTML format; the final result is chosen by comparing the two outputs.

The algorithm exhibits time complexity of O(n×m) where n and m are the image dimensions, as each pixel is visited exactly once. Space complexity is also O(n×m) due to the visited matrix and worst-case breadth-first search queue.

Experiments and Results

Metrics

The table recognition performance of a method on a test set is defined as the mean of the TEDS scores between the recognition result and ground truth of each sample. While TEDS_struct provides a reliable measure of structural similarity, its high average score can mask the fact that many individual tables still contain one or more structural errors. To address these shortcomings we introduce an additional metric TEDS_struct100 that report the proportion of tables where the entire structure is recognized perfectly (100% recognition of the table structure) on the test dataset. The idea of calculating TEDS₁₀₀ is the same, except that OCR accuracy is taken into account.

Experimental Results

The evaluation of the proposed approach was performed on two benchmark test datasets FinTabNet and PubTabNet. We conducted experiments with addition of SynthTabNet dataset to compare results with our proposed method of data generation (referred as Synth). As we can see from Tab. 1, our approach yields performance improvements, confirming that any synthetic data enrichment effectively mitigates the inherent limitations of the original dataset, and the proposed metric TEDS100 demonstrates its consistency and relevance. Also, as part of the experiments, a comparison between the full and optimized for mobile device versions of the models was conducted. The comparison was performed on a Galaxy Z Fold 5 with a Qualcomm Snapdragon 8 Gen 2 chipset running the Android operating system. The results definitively demonstrate that the optimized version of the model, despite a slight decrease in accuracy, operates significantly faster than the full model. We also performed the comparison with the NCGM [4]. PIX-TAB model shows improved accuracy and recognition time (Tab. 2).

Table 1. PIX-TAB evaluation results.

Table 2. PIX-TAB performance evaluation on mobile device (PubTabNet).

Conclusions

In conclusion, this paper introduces an efficient PIX-TAB approach that combines position-aware pixel-precise tokens, an encoder-decoder model, speculative decoding, region-based image segmentation, and a hybrid selection strategy to achieve accurate and fast TSR. Our method stands out for its ability to deliver precise, pixel-level structure using a compact and fast neural network suitable for on-device deployment, while remaining language-agnostic – support for new languages is achieved by simply swapping the OCR model. Experimental results validate each component. The research strongly confirms the efficiency of the region-based image segmentation approach. Specifically, for the MarketingStyle part of the SynthTabNet, TEDS_struct100 increases from 56.14 % to 57.59 %, and TEDS₁₀₀ rises significantly from 35.08 % to 45.61 %. Incorporating synthetically generated tables to the training data further enhances performance, increasing TEDS_struct from 95.2 % to 95.5 % and TEDS from 89.3 % to 89.6 % for PubTabNet. Furthermore, the model optimized for mobile devices demonstrated remarkable speed gains, achieving 3.66x and 3.01x faster performance for FinTabNet and PubTabNet, respectively, with only a marginal loss in accuracy.

Link to the paper

References

[1] Nam Ly and Atsuhiro Takasu. An end-to-end multi-task learning model for image-based table recognition. In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, page 626–634. SCITEPRESS - Science and Technology Publications, 2023.
[2] Lei Hu and Shuangping Huang. Enhancing table structure recognition via bounding box guidance. In Pattern Recognition, pages 209–225, Cham, 2025. Springer Nature Switzerland.
[3] Weihong Lin, Zheng Sun, Chixiang Ma, Mingze Li, Jiawei Wang, Lei Sun, and Qiang Huo. TSRFormer: Table structure recognition with transformers. In Proceedings of the 30th ACM International Conference on Multimedia, page 6473–6482, New York, NY, USA, 2022. Association for Computing Machinery.
[4] Maksym Lysak, Ahmed Nassar, Nikolaos Livathinos, Christoph Auer, and Peter Staar. Optimized table tokenization for table structure recognition. In Document Analysis and Recognition - ICDAR 2023, pages 37–50, 2023.
[5] Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA, 2023. Association for Computational Linguistics.

#CVPR #VisionAI

1. Introduction

Speech (utterance) restoration is the task of recreating high-fidelity speech from imperfect recordings degraded by noise, reverberation, and other distortions. Such problems have been addressed by classical and neural signal processing methods [21], but their reliance on fixed statistical assumptions limits generalization to varied acoustic conditions.

In recent years, generative diffusion models have been shown to be remarkably effective in this domain, demonstrating leading performance on various benchmarks [9]. By modelling the distribution of clean speech conditioned on corrupted inputs, they produce natural, intelligible reconstructions. However, while effective, these models typically operate on high-dimensional sound representations and require lengthy iterative sampling, resulting in high computational costs that hinder real-time or edge deployments [10]. This challenge is especially acute for high-definition 48kHz audio, where spectrograms are considerably larger and each denoising step becomes significantly more expensive.

In this blog post, we present LAFUFU — a novel approach to the utterance restoration problem leveraging latent-space acoustic representations. Rather than performing the diffusion process on raw audio spectrograms, LAFUFU operates on compact features extracted by a bespoke autoencoder. Utilizing those dedicated latent representations enables significant inference speedups without sacrificing output quality — a critical advantage for high-definition audio processing. We also show that, given equivalent time constraints, LAFUFU is capable of producing higher-quality restored utterances than the classical non-latent alternatives, as evidenced by its competitive performance on the EARS-WHAM and EARS-Reverb 48kHz frontier benchmarks.

The remainder of this post is structured as follows. We first provide the necessary background on score-based generative models and latent space diffusion. We then describe the LAFUFU methodology in detail, followed by a presentation of our experimental results. We discuss the implications of our findings, compare them with prior work, and conclude with a summary and outlook on future directions.

Figure 1. Results overview.

2. Background

Problem formulation

Let us define the distorted recording as the sum $Y= \cal{A}(X)+\cal{N}$, where $X$ denotes the original clean speech, while $\cal{A}$ and $\cal{N}$ represent degradations caused by the external environment. $\cal{A}$ could take the form of a convolution operator $X ∗ H$, where $H$ is e.g. a room impulse response. The additive term $\cal{N}$ is typically interpreted as background noise. The aim of the utterance restoration task is to recreate $X ′ ≈ X$ from said noisy $Y$. For the purpose of this work, all variables are represented as complex tensors storing the short-time Fourier transform (STFT) coefficients [2].

Score-based generative models

The restoration task can be effectively recast as a conditional generation problem, enabling adaptation of existing generative frameworks [4]. Given their strong benchmark performance, we focus on approaches employing mean-reverting stochastic differential equations (SDEs) [5]. For our baseline analysis, we selected SGMSE+ [11], a proven score-based model using noise-conditional score networks, as it represents a well-established solution in this methodological family.

Figure 2. Denoising architectures.

SGMSE+ sees the result synthesis mechanism as the inverse complement of a certain diffusion process, defined by the following forward SDE:

$ dX_t=f(X_t,Y)dt+g(t)dw $

where $w$ is a standard Wiener process [7], $f$ is a drift function, $g$ is a diffusion coefficient, $t ∈ [0, T]$, and $X_t$ denotes the current state of the working variable (with $X_0= X$). In practice, this forward procedure gradually transforms the initial clean speech sample X into its distorted counterpart, while simultaneously perturbing it with Gaussian noise.

This process can be run backwards in time (therefore recreating the original audio) by utilizing the associated reverse SDE [1] (where w̄ is a reverse-time Wiener process):

$ dX_t=[-f(X_t,Y)+ g(t)^2 ∇_{X_t} log⁡ p_t (X_t|Y)]dt+g(t)dw̄ $

The $∇_{X_t} log⁡p_t (X_t |Y)$ term, where $p_t$ denotes the conditional probability density, known as the score, cannot be calculated without prior knowledge of the target $X$. Fortunately, it can be replaced by a learnable parametrised approximation $ s_θ (X_t,Y,t)$ (e.g., in the form of a multi-resolution deep U-Net). Thus, the restoration workflow boils down to: initialising $X_t=Y+ N(μ,σ²)$ (with mean $μ$ and variance $σ²$), dividing the [0, T] interval into N discrete steps (not to be confused with the noise term $N$ above), employing a suitable numerical solver, and iterating back through the SDE. As a consequence, the operational core of SGMSE+ consists mainly of the looped denoising steps, which in turn rely heavily on repeated calls to the neural score model $s_θ$.

Latent space diffusion

The main disadvantage of previous mean-reverting SDEs is their iterative, multi-stage enhancement process, which requires significant computational resources. This limitation makes them impractical for real-time applications or resource-constrained environments. A common solution involves transferring the diffusion process from high-dimensional input space to a compact latent space using pretrained variational autoencoders (VAEs) [12]. However, while suitable VAEs exist for general tasks like image generation, they are often unavailable for specialized domains with scarce data. To address this, recent work in image restoration introduced Refusion [6], a simplified, task-specific autoencoder tailored for enhancement needs.

3. Methodology

LAFUFU is a unified lightweight technique that combines the expressive strength of the SGMSE+ model with the efficiency gains provided by the latent-space diffusion paradigm. Rather than running diffusion on full spectrograms, it operates on compact features produced by a bespoke autoencoder — designed and trained from scratch specifically for speech restoration.

Architecture

Our method adapts the Refusion autoencoder (AE) for STFT-based speech processing by treating time and frequency as spatial dimensions. To handle complex numbers, we encode real and imaginary components as separate image channels. Given the sparsity of STFT spectrograms and the varied scales of voice-related features, we replace the typical L1 loss with a multi-resolution STFT loss (MRSTFT) [13] for superior perceptual reconstruction quality. The AE architecture is simplified by using a U-Net with only two down/upsampling blocks — a reduction from the base Refusion's three — due to spectrograms having significantly lower resolution than high-definition images.

Figure 3.

We retain the original Reg-Loss mechanism, which penalizes embeddings that diverge significantly from the input's statistical properties, as it effectively prevents fragmentation of the latent space into discontinuous hash-like encodings. Formally, Reg-Loss is implemented as:

$ RegLoss(Z_Y,Y)=|μ_(Z_Y )-μ_Y |+|σ_{Z_Y}- \frac 12 σ_Y | $

where $Z_Y$ is the latent embedding of a distorted sample, while μ and σ denote the mean and standard deviation of the given tensor elements.

We follow Luo et al. [6] in employing their latent-replacement approach, where the decoder constructs the output using multi-level features from the distorted input (always available in restoration tasks). This allows the latent tensor to focus solely on encoding the necessary modifications rather than the complete target signal, avoiding the challenge of representing high-entropy components and resulting in a more efficient and robust AE architecture.

For the generative diffusion core, we closely follow the standard SGMSE+ architecture to ensure performance gains stem specifically from our latent-centric approach rather than score model modifications. We only remove its first and last layers, as their raw feature preprocessing is now handled by the autoencoder.

Experimental setup

Computational inefficiencies of diffusion methods become more evident when dealing with high-resolution audio, due to noticeably larger input resolutions. Thus, we decided to focus on audio recordings sampled in 48kHz and utilise EARS-WHAM and EARS-Reverb benchmark datasets [14] as the chief diagnostic tools.

For each sub-benchmark, we trained a matching pair consisting of an AE and a score model. Our primary objective was to evaluate whether latent domain diffusion could achieve performance parity with the original SGMSE-class models. To systematically assess this, we implemented latent score models with varying channel configurations (128×, 192×, and 256× per-block channel multiplier), naming them respectively LAFUFU₁₂₈, LAFUFU₁₉₂, and LAFUFU₂₅₆. These models were benchmarked against publicly available SGMSE+ checkpoints, with the smallest one used as a reference point for the ablation study. Additionally, to investigate the effect of model size reduction on output quality, we trained scaled-down SGMSE+ with 64× and 96× channel counts.

Our input preprocessing pipeline and score model training recipe were consistent with prior work [14]. The AE optimization employed MRSTFT loss weight of 1.0 combined with a 0.1-weighted Reg-Loss term. MRSTFT covered eight window lengths (32, 64, 128, 256, 512, 1024, 1534, and 2048 samples), each paired with a ¼ length hop size.

To rigorously evaluate model performance, we employed SI-SDR, PESQ, ESTOI, and DNSMOS for quality assessment alongside real-time factor (RTF) as a computational efficiency measure. Inference was performed utilising the predictor-corrector setup inherited from prior work [14]. To mitigate random initialization effects, we performed three independent training runs for each experimental condition on both EARS-WHAM and EARS-Reverb datasets. Results present mean values, mean standard deviations, and metric-wise standard deviations across all repetitions. All discussed procedures were conducted on a single NVIDIA A100 GPU.

4. Results

EARS-WHAM benchmark

The table and plots below present the results on the EARS-WHAM benchmark (speech denoising task):

Table 1. EARS-WHAM benchmark results.

Figure 4. Relation between speech restoration quality and inference speed (EARS-WHAM benchmark).

EARS-Reverb benchmark

The table and plots below present the results on the EARS-Reverb benchmark (speech dereverberation task):

Table 2. EARS-Reverb benchmark results.

Figure 5. Relation between speech restoration quality and inference speed (EARS-Reverb benchmark).

5. Discussion

The gathered experiment outcomes confirmed that performing iterative denoising in the condensed latent space leads to multifold improvements in inference speed — a result of particular significance for high-definition 48kHz audio, where the computational burden of operating on full spectrograms is most acute. The enhanced performance enables scaling up score model sizes beyond previous architectural limits, yielding better output quality at lower real-time factor (RTF) targets. This is particularly evident on the EARS-Reverb benchmark, where LAFUFU not only surpasses its SGMSE+ foundation but also achieves state-of-the-art comparable evaluation scores.

These results highlight representation learning as a key enabler for unlocking generative diffusion potential in audio applications, suggesting further progress is achievable via this research avenue.

Ablation study

We conducted an ablation study on the EARS-Reverb benchmark to assess the contribution of individual architectural components:

Table 1. Ablation study results (EARS-Reverberant benchmark).

Removal of the encoder-decoder hidden connections resulted in marginal decrease across all evaluation metrics, but offered no tangible gains in inference RTF. Thus, while not critical, their influence was deemed as overall beneficial.

Attempts at suspending the Reg-Loss component revealed its crucial role in the AE framework, as all models trained without it exhibited substantial performance degradation. Our empirical evidence suggests that preserving the statistical properties of the original space in its latent representation is fundamental for maintaining robustness of the dependent score models.

Comparison with prior work

Score-based generative diffusion has already been applied to the complex STFT domain [16] and demonstrated effective for speech enhancement. Those successes inspired a plethora of follow-up studies [4], with contemporary ones exploring Schrödinger bridge formulation [9] or hybridising with an adversarial network [3]. Latent diffusion techniques, initially introduced for HD image synthesis [12], found adoption in multiple sound-adjacent scenarios, such as text-to-audio generation [8] or editing [17]. In the context of utterance restoration, latent space embeddings started gaining traction in the past year, seeing use as an auxiliary mechanism in multi-stage enhancement pipelines [18], a part of transformer-based solutions [19], or an enabler for a dual-context conditional diffusion model [20]. However, none of those studies focused on the latency tradeoffs critical for real-time use cases. LAFUFU fills this gap by demonstrating that a bespoke, task-specific latent space can achieve competitive quality with multifold improvements in inference speed.

Limitations and future work

The primary drawback of this approach stems from its dual-model architecture, which increases memory demands and parameter complexity. However, we argue that LAFUFU's advantages outweigh these limitations, establishing latent-driven methods as a viable direction for speech enhancement research.

Link to the paper

ICASSP 2026 Paper: LAFUFU: Latent Acoustic Features For Ultra-Fast Utterance Restoration | IEEE Conference Publication | IEEE Xplore

Demo page: https://samsunglabs.github.io/LAFUFU/

References

1. Anderson, B. D. O. (1982). Reverse-time diffusion equation models. *Stochastic Processes and their Applications*, 12(3), 313–326.
2. Fu, S.-W., Hu, T.-Y., Tsao, Y., & Lu, X. (2017). Complex spectrogram enhancement by convolutional neural network with multi-metrics learning. *2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP)*, 1–6.
3. Han, S., Lee, S., Lee, J., & Lee, K. (2025). Few-step Adversarial Schrödinger Bridge for Generative Speech Enhancement. *Interspeech 2025*, 2380–2384.
4. Lemercier, J.-M., Richter, J., Welker, S., Moliner, E., Välimäki, V., & Gerkmann, T. (2025). Diffusion models for audio restoration: A review. *IEEE Signal Processing Magazine*, 41(6), 72–84.
5. Luo, Z., Gustafsson, F. K., Zhao, Z., Sjölund, J., & Schön, T. B. (2023a). Image restoration with mean-reverting stochastic differential equations. arXiv:2301.11699.
6. Luo, Z., Gustafsson, F. K., Zhao, Z., Sjölund, J., & Schön, T. B. (2023b). Refusion: Enabling large-size realistic image restoration with latent-space diffusion models. *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*, 1680–1691.
7. Malliaris, A. G. (1990). Wiener process. In *Time Series and Statistics* (pp. 316–318). Springer.
8. Liu, H., Chen, Z., Yuan, Y., Mei, X., Liu, X., Mandic, D., Wang, W., & Plumbley, M. D. (2023). AudioLDM: Text-to-Audio Generation with Latent Diffusion Models. arXiv:2301.12503.
9. Nasretdinov, R., Korostik, R., & Jukić, A. (2025). Robust Speech Recognition with Schrödinger Bridge-Based Speech Enhancement. *ICASSP 2025*, 1–5.
10. Richter, J., de Oliveira, D., & Gerkmann, T. (2025). Investigating Training Objectives for Generative Speech Enhancement. *ICASSP 2025*.
11. Richter, J., Welker, S., Lemercier, J.-M., Lay, B., & Gerkmann, T. (2023). Speech enhancement and dereverberation with diffusion-based generative models. *IEEE/ACM Transactions on Audio, Speech, and Language Processing*, 31, 2351–2364.
12. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*, 10684–10695.
13. Yamamoto, R., Song, E., & Kim, J.-M. (2020). Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. *ICASSP 2020*, 6199–6203.
14. Richter, J., Wu, Y.-C., Krenn, S., Welker, S., Lay, B., Watanabe, S., Richard, A., & Gerkmann, T. (2024). EARS: An anechoic fullband speech dataset benchmarked for speech enhancement and dereverberation. arXiv:2406.06185.
15. Richter, J., & Gerkmann, T. (2024). Diffusion-based speech enhancement: Demonstration of performance and generalization. *Audio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation*.
16. Welker, S., Richter, J., & Gerkmann, T. (2022). Speech enhancement with score-based generative models in the complex STFT domain. arXiv:2203.17004.
17. Wang, Y., Ju, Z., Tan, X., He, L., Wu, Z., Bian, J., & Zhao, S. (2023). AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models. arXiv:2304.00830.
18. Dhyani, T., Lux, F., Mancusi, M., Fabbro, G., Hohl, F., & Vu, N. T. (2025). High-Resolution Speech Restoration with Latent Diffusion Model. arXiv:2409.11145.
19. Guimarães, H. R., Su, J., Kumar, R., Falk, T. H., & Jin, Z. (2025). DiTSE: High-Fidelity Generative Speech Enhancement via Latent Diffusion Transformers. arXiv:2504.09381.
20. Zhao, S., Pan, Z., Zhou, K., Ma, Y., Zhang, C., & Ma, B. (2025). Conditional Latent Diffusion-Based Speech Enhancement Via Dual Context Learning. arXiv:2501.10052.
21. Lu, X., Tsao, Y., Matsuda, S., & Hori, C. (2013). Speech enhancement based on deep denoising autoencoder. *Interspeech 2013*, 436-440.

6G ISAC: Expanding the Value of Mobile Networks Beyond Connectivity

Jaeyeon Shim|Zhongfeng Zhang|Yu Bin|Hyoungju Ji — Invalid Date

Introduction: From Connectivity to Awareness

Mobile networks have traditionally been designed to connect people, devices, and services. Each generation has improved data rates, latency, reliability, coverage, and capacity. In 6G, however, the role of the network is expected to expand further. A 6G network may not only deliver information, but also help understand the physical environment in which communication takes place.

Integrated Sensing and Communication, or ISAC, is one of the key technologies behind this shift. The basic idea is simple: radio signals used for communication can also carry information about the surrounding environment. By analyzing reflections, propagation characteristics, and measurement results, the network may detect objects, track movement, or understand changes in the radio environment.

The important point is that ISAC is not just about adding radar-like functions to a base station. Its broader value lies in reusing communication infrastructure, radio resources, and signal processing capability to create a distributed sensing layer. If designed carefully, 6G ISAC can expand the value of mobile networks beyond connectivity without requiring a completely separate nationwide sensing infrastructure.

This is why ISAC is becoming an important topic in 6G. It connects several long-term directions: network intelligence, service expansion, coverage-aware operation, energy and deployment efficiency, and tighter integration between the physical and digital worlds.

What ISAC Means for 6G Networks

The value of ISAC starts from infrastructure reuse. Mobile networks already have wide-area deployment, radio coverage, synchronization, beamforming capability, and baseband processing resources. If sensing can be integrated into this existing communication framework, the network can provide sensing capability with a lower incremental deployment burden than building a separate sensing system from scratch.

From this perspective, ISAC can be understood as a new value layer on top of the communication network. A base station, a fixed wireless access device, a vehicle, or another network-connected node may contribute to sensing, depending on its capability and deployment condition. The sensing result may then be used by applications, by the network, or by other devices.

This also creates new business opportunities for 6G. Today, the main value of mobile networks is still largely tied to connectivity: data plans, broadband access, enterprise private networks, and device connections. ISAC can expand this value by enabling sensing-based services on top of the same infrastructure. For example, environmental awareness, road and traffic monitoring, industrial site monitoring, public safety support, indoor presence detection, and context-aware network operation may become new service opportunities if sensing information can be provided in a reliable and privacy-aware manner.

The business value of ISAC does not come from replacing dedicated sensors or radar systems in every scenario. Instead, it comes from enabling “good-enough and widely available” sensing capability where communication infrastructure already exists. This distinction is important. A dedicated sensing system may provide higher accuracy for a specific mission, but it often requires separate deployment, operation, and maintenance. ISAC can create value by offering scalable sensing coverage with lower additional deployment effort, especially for use cases that benefit from wide-area observation rather than extremely high sensing precision.

This can be particularly meaningful for operators and ecosystem partners. For operators, ISAC may provide a path to move beyond pure connectivity revenue and offer sensing-enabled services to enterprises, municipalities, mobility platforms, home service providers, and industrial customers. For device and infrastructure vendors, ISAC may create new requirements for sensing-capable radio design, reporting frameworks, edge processing, and service platforms. For application providers, it may open a new source of environmental and contextual information, provided that privacy, security, and data ownership are properly addressed.

This does not mean that every communication node should become a high-performance radar. A practical 6G ISAC design needs to respect the constraints of communication systems. Sensing should be implementable, as much as possible, using common RF and baseband modules. This principle is important because the real strength of ISAC is scale: useful sensing capability should be added without making the communication network significantly more complex or costly.

A dedicated radar system may provide strong sensing performance in a specific location. A communication network, on the other hand, can provide broad and continuous coverage if sensing is integrated efficiently. The design goal should therefore be to find a practical balance: useful sensing capability, reasonable system impact, acceptable cost for communication operation, and clear service value for the 6G ecosystem.

From 5G ISAC to 6G ISAC: Why the Scope Needs to Expand

In 5G-related ISAC studies, UAV detection has often been used as an intuitive and representative example. This makes sense. UAV detection is easy to understand, and it clearly shows how a communication network may be used to detect or track passive objects without deploying dedicated radar infrastructure.

UAV detection remains an important use case for 6G ISAC. However, if ISAC is discussed only through UAV detection, its value may look too narrow. The broader opportunity for 6G is environmental awareness. A 6G network should be able to understand not only whether a specific flying object exists, but also how the surrounding environment affects communication and services.

This includes buildings, roads, vehicles, crowds, blockage, reflection, clutter, and other environmental factors. A vehicle passing through a street, a crowd forming in a public space, a moving blockage in an indoor environment, or a change in surrounding reflectors may all affect radio links. ISAC can provide an additional source of information about these changes.

In other words, 6G ISAC should move beyond object detection alone. It should become a framework for sensing the environment around communication networks.

Key Use Cases for 6G ISAC

1. Environment and Background Sensing

A practical starting point for 6G ISAC is environment or background sensing. Instead of focusing only on a specific target, the network may collect information about the surrounding radio environment. This can include static objects such as buildings or walls, non-static objects such as vehicles or people, and environmental factors such as blockage, multipath, clutter, and interference.

This type of sensing can support both services and network operation. For example, background sensing may help identify traffic congestion, crowd density, or changes in the local environment. It may also help the network understand why a link is degraded: whether the cause is blockage, mobility, interference, or a change in the propagation environment.

From a communication-assistance perspective, this type of environmental information can be useful because it is directly related to channel behavior. Environmental sensing may provide information on multipath, blockage, fading, clutter, and interference. Such information may potentially help communication-related functions, but the benefit should not be overstated.

Sensing information does not automatically improve communication performance. How the network uses such information for scheduling, beam management, mobility, or link adaptation may depend on implementation and system design. For the early stage of 6G ISAC, a more realistic goal is to define what sensing information can be measured, how it can be reported, and what level of reliability can be expected.

This is why background sensing is important. It can become the foundation for many future ISAC services without forcing the system to assume one specific application from the beginning.

2. Vehicle, RSU, and Fixed-UE Sensing

UE-side sensing is another important direction for 6G ISAC. However, not all UEs are equally suitable for sensing. A handheld smartphone has strict constraints: limited antenna size, limited transmit power, battery constraints, mobility, and uncertain position or orientation. These factors can make sensing information less reliable, especially when accurate location or angle information is required.

This is why 6G ISAC should look beyond smartphones. Vehicles, road-side units, customer premises equipment, and other fixed or semi-fixed devices may be more attractive sensing nodes.

A vehicle, for example, has several advantages over a handheld device. It can support better antenna placement, has a more stable power supply, and can continuously observe its surroundings while moving through roads. Vehicle-mounted sensing may help monitor nearby objects, road congestion, surrounding buildings, local blockage, and traffic density. This information may be useful not only for the vehicle itself, but also for broader network or service-level awareness.

Road-side units and fixed UEs can provide even more stable sensing information. Their location and orientation can be known or controlled, which improves the confidence of sensing results. This distinction is critical for 6G ISAC. “UE sensing” should not be treated as a single category. A smartphone, a vehicle, an RSU, and a fixed indoor device have different antenna capabilities, power budgets, processing assumptions, mobility patterns, and reporting feasibility. A practical 6G ISAC framework should define these assumptions clearly.

3. Indoor CPE-Based Sensing

Indoor fixed devices, such as CPEs used for fixed wireless access, may also become meaningful sensing nodes in 6G. Unlike handheld devices, CPEs are usually installed in a fixed location, connected to a stable power source, and may have a larger form factor. These characteristics make them more suitable for continuous sensing than battery-limited mobile devices.

A fixed CPE can act as a stable observation point in an indoor environment. Potential home-oriented use cases may include presence detection, indoor environment awareness, safety monitoring, or detection of unusual movement patterns. These examples should be treated carefully, because indoor sensing is closely related to privacy.

A useful ISAC system should not simply collect as much information as possible. It should define what is measured, what is reported, where the sensing function is processed, and how sensitive information is protected.

For this reason, CPE-based sensing may become a good example of the balance required in 6G ISAC. The device characteristics are favorable, the use cases are understandable, and the deployment model is practical. At the same time, privacy, reporting overhead, and processing responsibility must be considered from the beginning.

4. Communication Assistance

Communication assistance is another important use case for ISAC, but it requires careful framing. The intuitive idea is that if the network understands the environment better, it may operate communication links more efficiently. For example, sensing information may help identify blockage, moving objects, reflectors, or changes in the propagation environment.

This information could potentially support beam management, CSI acquisition, mobility handling, or interference management. However, it is not yet sufficient to assume that sensing will directly guarantee communication performance improvement in all scenarios.

A practical approach is to separate two questions. The first question is: what sensing information can be measured and reported reliably? The second question is: how should the network use that information to improve communication? In the early stage of 6G ISAC, the first question should come before the second.

This approach avoids overloading ISAC with too many communication-performance assumptions, while still allowing sensing information to become useful for future network optimization.

What 6G ISAC Needs to Address

For ISAC to become a practical 6G feature, 6G ISAC needs to focus on integration with communication from day one. The key question is not how to design the best standalone sensing waveform in isolation. The key question is how to support useful sensing capability within a communication system that must also serve users, manage interference, control overhead, and remain implementable.

Start from Communication-Compatible Design

6G ISAC should start from the communication framework. CP-OFDM-based waveform generation, common frame structure, and communication reference signal reuse should be evaluated before introducing sensing-specific designs. For sensing RS based operation, the signal characteristics may be treated in a similar manner to communication RSs, since they are mainly determined by sequence design and resource mapping.

This does not mean that no enhancement is possible. If a clear sensing performance gap is identified, enhancements to reference signal design and overall procedures, including sequence design, resource mapping patterns, repetition structure for possible range extension, resource configuration, or measurement reporting procedures may be considered. For example, ZC-based sequence design or frequency domain mapping may be considered within the CP-OFDM framework if existing communication RS are not sufficient for sensing. But these enhancements should be justified by realistic evaluation and should preserve communication compatibility as much as possible.

The reason is straightforward. ISAC is valuable because it is integrated with communication. If sensing requires a separate waveform, separate frame structure, or separate hardware assumption from the beginning, the deployment benefit becomes much weaker. A communication-compatible baseline keeps the system scalable and implementable. With this baseline, necessary enhancements may be achieved by optimizing the design of communication signals within the communication framework. Such an approach can allow the network to provide meaningful sensing capabilities without requiring hardware modifications to the communication transceiver.

Clarify Sensing Modes and Node Assumptions

6G ISAC also needs clear assumptions on sensing modes and node roles. Monostatic, bistatic, and multistatic sensing have different technical implications. The transmitter, receiver, sensing function, and reporting node may be the same or different. Consequently, requirements for synchronization, beam operation, interference, and reporting paths also vary depending on the sensing mode.

For BS-centric sensing, standard constraints, such as the need for standardized waveform, should be carefully assessed. For BS monostatic sensing, introducing specification constraints can prevent vendor-specific optimizations needed to handle interference across different deployment scenarios. For BS-BS bistatic sensing, the requirement for stringent synchronization presents practical challenges when implemented across different infrastructure vendors via inter-vendor collaboration. Therefore, relying on vendor implementation might be a better approach for those sensing modes than imposing standard constrains.

On the other hand, for UE-involved sensing, the assumptions become even more important. Various UE types, such as handheld UE, vehicle, fixed CPE, and RSU, should not be modeled in the same way. Their antenna capability, transmit power, processing capability, mobility, and reporting feasibility are different. Furthermore, when both BS and UE are involved (e.g., BS-UE or UE-BS bistatic sensing), the impact on UE operation and the integration with the communication framework become more significant.

Therefore, before defining detailed signal designs, 6G ISAC needs to clarify who transmits, who receives, who processes, and who reports sensing information. Understanding these limitations and device variations can help 6G standardization focus on essential standard impacts, while leaving deployment flexibility to implementation.

Build a Measurement and Reporting Framework

6G ISAC also needs a practical measurement and reporting framework. Sensing results can have very different forms. They may be raw samples, delay-Doppler-angle profiles, detected peaks, or processed target-level metrics. More detailed information may improve sensing quality, but it also increases reporting overhead, processing burden, and privacy risk.

This is especially important for UE-side sensing. If a UE needs to report large sensing data frequently, the overhead may become too high. If the UE processes too much information locally, device complexity and power consumption may increase. Depending on the reporting form, additional post-processing may be required at the aNB side. In such cases, additional contextual information such as sensitive location or environment information, may also be requested to interpret the sensing results report. If such information is included in the report, privacy concerns become significant.

The right design point may differ by use case. A vehicle monitoring road congestion may need frequent but compact reports. A fixed indoor CPE may require stronger privacy protection and local processing. A TRP-side sensing use case may allow more network-side processing but may require stronger coordination across cells. 6G ISAC needs to define flexible but manageable reporting mechanisms that reflect these differences.

Evaluate Interference Realistically

6G ISAC needs realistic interference modeling from the beginning. Sensing signals do not exist in an empty environment. They coexist with communication traffic, neighboring cells, other sensing operations, and reflections from background objects.

This is particularly important for multi-node or cooperative sensing. For example, when BS monostatic sensing is performed simultaneously in different cells, a sensing signal transmitted by one BS may act as interference to echo reception at another BS. Similarly, different cells may operate the same or different sensing modes. In UE-involved sensing, UE transmission for sensing may also create new interference and coexistence conditions depending on how sensing reference signals are transmitted, received and reported.

Without interference-aware evaluation, sensing performance may look promising in simulation but become fragile in real deployment. 6G ISAC therefore needs to evaluate not only ideal sensing accuracy, but also sensing reliability under realistic intra-cell, inter-cell, and intra-node interference conditions.

Balance Sensing Gain and System Cost

Finally, 6G ISAC needs to balance sensing gain against communication impact and implementation cost. A sensing-specific design may improve a particular sensing metric, but it may also increase resource overhead, reduce scheduling flexibility, require additional RF capability, or increase UE power consumption.

This trade-off is especially important because ISAC is intended to be integrated with communication. A design that works well for sensing alone may not be suitable if it consumes too many communication resources or creates too much device complexity.

Therefore, 6G ISAC should evaluate candidate designs across three dimensions at the same time: sensing performance, communication impact, and implementation feasibility. Only when all three are considered together can ISAC become practical for commercial networks.

Conclusion

ISAC is one of the technologies that can expand the role of 6G networks beyond connectivity. Its value is not limited to detecting a specific object such as a UAV. The broader opportunity is to make mobile networks more aware of the physical environment around them.

For this reason, 6G ISAC should focus on practical and scalable use cases: environment and background sensing, vehicle and fixed-UE sensing, indoor CPE-based sensing, and carefully scoped communication assistance. These use cases can build on the existing strength of mobile networks: broad coverage, distributed infrastructure, radio resources, and signal processing capability.

At the same time, ISAC must be designed with realistic constraints. Sensing should be integrated with communication, not isolated from it. CP-OFDM-based waveform generation, common frame structure, communication reference signal reuse, clear UE assumptions, compact reporting, privacy awareness, and interference-aware evaluation should be considered from the beginning.

The key message is simple: 6G ISAC should not be about turning every base station into a standalone radar. It should be about making the communication network aware of its environment in a practical, scalable, and deployable way.

Practical AI-Driven Traffic Classification for Next Gen Service-Aware RAN

Sunhyun Kim|Sangho Lee|Daeun Ko|Jeonga Lim|Hyungwoo Ku — Invalid Date

1. Introduction

Modern mobile networks are carrying increasingly diverse types of traffic, including video streaming, cloud gaming, Exteded Reality (XR) / Virtual Reality (VR) services, and emerging AI-driven applications. Each service has its own traffic characteristics and quality-of-service (QoS) requirements, making real-time analysis of traffic behavior essential for efficient Radio Access Network (RAN) operation. Traffic classification allows the network to identify service characteristics and optimize network control accordingly. For example, service-awareness, leveraging traffic pattern information, enables RAN operations to adaptively control Radio Resource Control (RRC) state. This approach can improve radio resource allocation, optimize QoS management, reduce unnecessary signaling, and enhance UE power efficiency.

With the growing diversity of mobile services and increasingly dynamic network conditions, network optimization can no longer rely solely on static configurations and predefined policies. This requires the RAN to continuously understand traffic behavior and adapt its operation accordingly. As networks evolve toward 6G, such service-aware operation is becoming an essential component of intelligent and autonomous RAN systems.

Recently, AI-RAN has emerged as a key direction for enabling self-aware and adaptive network operation [1]. By integrating artificial intelligence (AI) into the RAN, networks can dynamically analyze traffic patterns and adaptively optimize network behavior without relying on static rule-based policies. In particular, AI-based traffic classification has garnered significant attention because it can infer traffic characteristics even in encrypted traffic using statistical and temporal traffic features [2][3].

However, deploying AI-driven traffic classification in RAN systems presents significant challenges. Current methods typically depend on computationally intensive deep learning models or large-scale manually labeled datasets [4], rendering them unsuitable for real-time RAN environments with stringent latency and resource limitations. Furthermore, the continuous collection and maintenance of labeled traffic datasets become increasingly impractical as new applications and traffic patterns rapidly emerge.

To address these challenges, this blog introduces a practical AI-driven traffic classification framework for service-aware RAN operations. The proposed scheme combines clustering-based pseudo-labeling with lightweight flow-level inference to enable practical deployment in the Central Unit (CU). Furthermore, the classification results are utilized for adaptive RRC state control to reduce UE power consumption while minimizing signaling overhead. Experimental results in a testbed demonstrate the feasibility and effectiveness of the proposed approach. Beyond RRC optimization, the proposed framework can also be extended to various service-aware RAN functions that require real-time traffic understanding in future AI-native 6G networks.

2. Technical Challenges

Traffic classification systems in RAN environments encounter challenges at both the data processing and the model deployment levels. Achieving accurate, real-time classification while maintaining low computational overhead demands carefully designed algorithms and system architectures. Below we identify challenges in more detail.

Data Processing Level:

Feature Extraction while maintaining User Plane Throughput: Extracting meaningful features from raw packet data in real-time without degrading the throughput is a key requirement. Efficient processing pipelines are required to minimize overhead while ensuring real-time operation.

Temporal Dynamics: Traffic patterns present complex temporal dependencies that diverge across different applications. Capturing these patterns necessitates carefully chosen measurement windows and effective feature engineering.

Label Scarcity: Obtaining accurate labels for training supervised models is costly and time-consuming. The rapid emergence of new applications and services further intensifies this challenge.

Model Deployment Level:

Computational Budget: The CU has limited computational resources shared among various functions. Traffic classification models must operate efficiently within strict CPU and memory constraints.

Inference Latency: Classification decisions must be made quickly enough to support RRC state control. The entire pipeline, from packet reception to classification output, must complete within a time budget to be applicable for RRC state control.

Model Adaptability: Traffic characteristics evolve as applications update and new services emerge. The classification system must adapt to these changes without relying on frequent retraining with labeled data.

3. The Proposed Scheme

To address these technical challenges, we propose a lightweight AI-driven traffic classification framework designed for practical deployment in the CU. This approach integrates clustering-based pseudo-labeling and lightweight flow-level inference to enable real-time, service-aware RAN optimization.

3.1 System Overview

The overall architecture of the proposed scheme consists of two main stages as illustrated in Figure 1:

Figure 1. Overall architecture

Model Training Stage: Traffic data is first collected from the CU, capturing packet-level information including timestamps, packet sizes, and flow identifiers (5-tuple information: source IP, destination IP, source port, destination port, protocol). Clustering algorithms are then applied to generate labeled training data. A traffic classification model is trained using this data and subsequently deployed to the CU for RRC state control.

Model Inference and RRC State Control Stage: The deployed model performs real-time traffic classification on incoming flows and dynamically adjusts RRC state control parameters, based on the classified traffic type.

This design enables a data-driven pipeline that eliminates the need for manual labeling and supports real-time deployment in the CU. The separation of training and inference stages allows that model updates can be performed without disrupting ongoing operations.

3.2 Clustering-based Labeling

Since manual labeling is costly and does not scale well with increasing traffic diversity, we adopt an unsupervised approach using clustering. The labeling process consists of the following steps:

Step 1 - Data Collection: Network traffic data is collected from the CU, capturing packet-level information including timestamps, packet sizes, and flow identifiers.

Step 2 - Clustering: Flows are grouped based on statistical features using an un-supervised clustering algorithm. The clustering algorithm is particularly suitable for this task because it does not require pre-specifying the number of clusters, can identify clusters of arbitrary shape, and naturally handles noise and outliers.

Step 3 - Flow Selection: Clusters corresponding to target services are identified based on cluster characteristics such as average packet size, inter-arrival time distribution, and temporal patterns. Representative flows from selected clusters are assigned labels (e.g., video), while other flows are labeled as other traffic. To improve label quality, flow selection considers temporal information - specifically, flows that do not overlap in time are greedily selected to avoid redundant sampling.

Figure 2. Clustering-based labeling process

3.3 Traffic Classification

Traffic is measured by collecting packet data within a defined measurement window, which is then segmented into flows based on 5-tupe information. Statistical features of downlink and uplink packets are extracted to capture temporal traffic patterns. The labeled dataset is used to train a lightweight flow-level classification model capable of distinguishing between:

Video on Demand (VoD): Characterized by intermittent burst patterns

Live Streaming: Characterized by continuous traffic flow

Other Traffic: Represents other traffic types that do not align with the above categories.

3.4 Adaptive RRC State Control

Different traffic types exhibit distinct patterns, which can be leveraged to optimize RRC state control. The key insight is that the RRC release timing can be dynamically adjusted based on the classified traffic type. As shown in Figure 3, when traffic is classified as VoD with intermittent bursts, a shorter inactivity timer allows the UE to quickly transition to idle state during gaps, thereby saving power. For continuous traffic like live streaming, a longer timer prevents unnecessary frequent state transitions, ensuring efficient network operation. Therefore, this approach enhances UE power efficiency while minimizing the increase in signaling overhead.

Figure 3. Adaptive RRC state control based on traffic classification showing different timer values for VoD and live streaming

Table 1. Adaptive inactivity timer based on traffic type

4. Implementation Results

4.1 Experimental Environment

Experiments were conducted in an in-house testbed using a commercial smartphone and base station. The detailed setup is shown in Table 2:

Table 2. Experimental environment specification

The core network and CU/DU are built on a commercially deployed CU package, reflecting practical deployment conditions. Traffic data was collected using YouTube applications on the smartphone, including both VoD and live streaming sessions, along with background traffic.

4.2 Traffic Classification Performance

Overall, the results show reliable classification performance, effectively distinguishing between target and non-target traffic. The classification accuracy of 98% demonstrates that the clustering-based labeling approach produces high-quality training data.

The UE power consumption and the classification accuracy of the proposed scheme were measured through repeated tests of the same video using a power monitoring tool in a controlled environment.

By applying adaptive RRC state control, the proposed scheme allows the UE to transition to the idle state more frequently. Despite the power consumption associated with VoD playback, the results validate that optimized RRC control is effective in reducing UE power consumption while maintaining minimal signaling overhead. The observed smartphone power saving of over 2%, though seemingly modest, translated to a significant extension in battery life when accumulated over typical daily usage patterns.

5. Conclusion

This blog outlines the motivation and concept of AI-driven traffic classification for RRC state control in 6G systems. As a practical solution addressing the challenges of real-time traffic classification in RAN environments, we have developed a traffic classification and RRC state control framework that incorporates clustering-based labeling and lightweight inference. Its classification accuracy, inference latency, and power-saving benefits have been validated through extensive experiments in an in-house testbed with commercial setup.

Key Contributions:

Self-aware Traffic Understanding: RAN automatically identifies traffic characteristics without manual intervention, enabling intelligent resource management.

Adaptive RRC State Control: Reduces UE power consumption without increasing signaling overhead, achieving measurable power savings in our experiments.

Practical Deployment: Demonstrated high classification accuracy and performance in a commercial environment.

The future work will focus on expanding the traffic classification to support more traffic types, improving the clustering-based labeling with semi-supervised learning approaches, and integrating the framework with other RAN optimization functions that improve the actual quality of experience for each traffic type.

References

[1] AI-RAN Alliance, “AI-RAN Alliance Vision and Mission White Paper,” 2025.
[2] Aceto, Giuseppe, et al. "Mobile encrypted traffic classification using deep learning: Experimental evaluation, lessons learned, and challenges." IEEE transactions on network and service management 16.2 (2019): 445-458.
[3] Shapira, Tal, and Yuval Shavitt. "FlowPic: A generic representation for encrypted traffic classification and applications identification." IEEE Transactions on Network and Service Management 18.2 (2021): 1218-1232.
[4] Lin, Xinjie, et al. "Et-bert: A contextualized datagram representation with pre-training transformers for encrypted traffic classification." Proceedings of the ACM Web Conference 2022. 2022.

From Modules to Agents: An Automatic AI Inference Optimization Compiler for 5G RAN

Jewon Jung|Daehan Kim|Youhwan Seol|Youngki Hong|Hoejoo Lee — Invalid Date

1. Introduction

The 5G Radio Access Network (RAN) has rapidly become a fertile ground for artificial intelligence. Across both the physical layer (L1, PHY) and the data link layer (L2), AI models are increasingly being deployed to handle tasks once reserved for hand-engineered signal processing algorithms [1]. 3GPP has approved work items for AI/ML over the air interface, with AI-native transceiver technologies positioned as a key differentiator for 6G systems [2].

Yet running AI models inside a 5G base station is fundamentally different from running them in a cloud or edge inference server. The pipeline of a 5G gNodeB (gNB) operates under hard real-time constraints, and any AI module embedded in the receive or transmit chain must complete its inference within that budget while sharing the CPU with the rest of the signal processing stack.

General-purpose inference compilers such as TVM [3] or OpenVINO [4] are designed for very different deployment profiles; their overheads and code generation strategies rarely align with the tight latency budgets of an in-line RAN component. Closing this gap typically requires hand-written, AVX-512 [5] Single Instruction Multiple Data (SIMD) - tuned C++ kernels — a process that is slow, expert-intensive, and brittle as both AI architectures and target hardware evolve.

This post documents our effort to automate that gap. We have been developing a hardware-aware AI inference compiler that translates trained AI models into AVX-512 SIMD-optimized C++ kernels for direct integration into 5G RAN signal processing pipelines. The work has progressed in two stages: first a module-based compiler with a deterministic pipeline of analysis, optimization, and code generation modules; then a multi-agent compiler in which LLM-driven agents take over the roles previously hard-coded into modules.

The remainder of this post is organized as follows: Section 2 motivates inference optimization in the 5G RAN context; Section 3 describes the module-based compiler; Section 4 reflects on its achievements and limitations; Section 5 introduces the multi-agent compiler and presents results; and Section 6 concludes.

2. The Case for Automatic AI Inference Optimization Compiler

A modern 5G gNB processes radio signals through a layered protocol stack. The physical layer and the data link layer handle the most computationally intensive tasks — channel estimation, equalization, modulation, and scheduling — all of which must execute within the time boundary of a single transmission slot. AI models inserted into this pipeline inherit the same deadline. There is no grace period: a model that computes its output 50 microseconds late does not degrade gracefully — it breaks the pipeline entirely.

Modern server-class CPUs, including those used in commercial 5G Distributed Units (DUs), expose a powerful class of instructions called SIMD. Intel’s AVX-512 can process 512 bits of data in a single instruction cycle, enabling simultaneous computation on sixteen 32-bit floating-point values at once. For AI workloads dominated by matrix multiplications and convolutions, this translates to throughput gains of an order of magnitude over scalar code. However, exploiting AVX-512 effectively is not automatic: data must be aligned to 64-byte boundaries, computation tiles must fit inside L1 or L2 cache to avoid costly main-memory accesses, and instruction sequences must be carefully ordered to keep all execution units occupied. Hardware-aware optimization, therefore, means knowing exactly which registers and cache levels a given layer’s tensors occupy, and tiling the computation accordingly.

Given these requirements, a natural question is whether existing approaches can fill the gap. Historically, the highest-performing AI inference code in latency-critical environments has been written entirely by hand: optimization experts study the target hardware, translate the model into C++, and iterate experimentally until the fastest implementation emerges. This was the path we ourselves took initially, and it produces excellent code, but it scales poorly with the diversity of 5G RAN AI models and the pace at which target hardware evolves.

The alternative is a general-purpose AI compiler such as TVM or OpenVINO. However, these are designed around a fundamentally different set of assumptions: the inference runtime operates as an isolated process, the latency target is loose rather than 5G RAN slot-bound.

Figure 1. Existing approaches to AI inference optimization either fail to scale (manual hand-tuning) or fail to meet the latency budget of 5G RAN (general-purpose compilers), motivating our automatic AI inference optimization compiler.

This gap — between what general-purpose AI compilers provide and what a real-time 5G RAN pipeline demands — is what motivated us to build our own hardware-aware AI inference compiler from the ground up. As Fig. 1 illustrates, existing approaches either fail to scale (manual hand-tuning) or fail to meet the slot-level latency budget (general-purpose compilers), leaving an unmet need that our compiler is designed to fill. Rather than adapting an existing compiler to a new deployment target, we chose to generate AVX-512 SIMD C++ kernels tailored to the exact microarchitecture and cache topology of the target DU platform.

3. First Generation: A Module-Based Compiler

Our first attempt at automating the kernel generation process took the form of a deterministic, module-based compiler structured as a clean three-stage pipeline — parse, optimize, generate — that mirrored the established design pattern of classical AI compilers, adapted to the demands of 5G RAN deployment. The objective was twofold: to capture the manual optimization workflow in a reproducible software form, and to do so without introducing any runtime dependency. Every model that entered the compiler had to leave as a self-contained C++ source and object file that could be linked by a standard toolchain and dropped directly into the DU codebase.

Figure 2. The module-based compiler operates as a deterministic three-stage pipeline. Each module hands off a structured artifact (Candidate Graphs, Kernel List) to the next, with all final choices grounded in measurement.

Fig. 2 illustrates the overall architecture. The Model Parser normalizes the input model and prepares a structured representation that exposes optimization opportunities for downstream stages. The Optimizer then evaluates candidate kernel implementations on the target hardware and selects the fastest realization for each operator. Finally, the Code Generator materializes the resulting design as complete C++ code. The design makes a deliberate trade-off — empirical evaluation over heuristic search — in exchange for output that is reproducible, inspectable, and free of runtime surprises.

3.1 Model Parser

The Model Parser is the entry point of the compiler. It accepts a trained model — supplied in a common deep learning framework format — together with the input tensor shape and a description of the target processor, and converts this into a structured representation that downstream stages can consume deterministically. Internally, the Parser performs this work in two phases.

The first phase normalizes the input model into a canonical graph representation, eliminating framework-specific code paths and stripping away artifacts left over from training or export so that every remaining node corresponds to a real, performance-relevant computation. The second phase enumerates candidate graph variants by applying a predefined set of operator fusion rules in different combinations, and annotates each candidate with the per-node metadata that downstream stages will need. Rather than committing to a single fused graph through a heuristic, the Model Parser produces the full set of candidates so that the final choice can be settled by measurement later in the pipeline.

3.2 Optimizer

The Optimizer acts as the empirical core of the compiler. Its objective is to map every operator in a candidate graph to the kernel implementation that runs fastest on the target hardware. Unlike traditional frameworks that rely on analytical cost models to estimate performance, our Optimizer is designed around a “hardware-in-the-loop” philosophy: mapping decisions are grounded in actual behavior on the target 5G RAN hardware.

Internally, the Optimizer first classifies operators by their performance characteristics and routes each through the evaluation strategy best suited to it — some operators benefit from empirical measurement against multiple kernel variants, while others can be resolved through lightweight static rules without incurring measurement overhead. The Optimizer then selects and validates the chosen kernels: any candidate graph whose kernels fail to execute correctly is discarded immediately, and the surviving mappings are serialized as the primary input for the Code Generator.

3.3 Code Generator

The Code Generator takes the optimization decisions produced by the Optimizer and turns them into deployable C++ code. The overall flow proceeds in two phases.

In the first phase, the generator traverses the graph and maps each operator to its assigned kernel implementation, resolving the necessary context for code emission — model weights, intermediate tensor declarations, and the connective logic between successive kernels. In the second phase, it synthesizes the final source, combining parameter loading code with the inference pipeline into a single, self-contained C++ file. By automating these phases end to end, the Code Generator delivers optimized inference code immediately whenever the input model or kernel set changes, eliminating any need for manual code modification.

4. Achievements and Limitations of the Module-Based Compiler

Figure 3. The module-based compiler reaches latency parity with expert hand-tuned implementations, far outperforming general-purpose compilers (TVM, OpenVINO) while reducing the time to product a deployable C++ kernel from days of manual work to a single invocation.

Before turning to its limitations, it is worth taking stock of what the module-based compiler actually delivered. The most immediate gain was development velocity. Where the previous workflow — analyzing target hardware by hand, drafting an optimization strategy, and writing AVX-512 C++ kernels line by line — could occupy a senior engineer for days or weeks per model, the compiler reduced that turnaround to a single automated invocation. A new model checkpoint could be passed in, and a deployable C++ source bundle would emerge with no human intervention in between.

Equally important, the quality of the output held up. We measured the average inference latency on the single core of server-class Intel CPU with fixed frequency. As shown in Fig. 3, the module-based compiler reaches latency parity with carefully hand-tuned baselines on small DNN-based 5G RAN AI models. The contrast with general-purpose AI compilers was even sharper: when the same models were executed under TVM or OpenVINO, neither could reach the latency envelope demanded by an in-line RAN component, falling short by more than an order of magnitude. Their generated code, optimized under assumptions of a standalone inference process and loose latency targets, simply did not fit the tight budget of a 5G slot. Our module-based compiler — by emitting self-contained AVX-512 kernels with no runtime, no scheduler, and no engine to coexist with — was the only configuration in our evaluation that produced AI inference fast enough to be deployable inside a real-time RAN signal processing pipeline.

These gains, however, came with a structural ceiling that became increasingly visible as we attempted to scale the compiler to more models and more hardware variants. The most pressing limitation was the dependence on the pre-implemented kernel library. Although the compiler chose between candidate kernels intelligently, every kernel still had to be written by hand. Adding support for a new operator type — a layer normalization variant, an attention primitive, an unfamiliar activation — required an expert engineer to design, implement, vectorize, and validate a new SIMD kernel before the compiler could even consider the operator. If a model arrived containing an operator for which no implementation existed, the compiler simply could not produce output. The automation we had achieved was conditional on a manually maintained library, and that library scaled linearly with engineering effort.

A second limitation lay in the rule-based graph optimization itself. The fusion rules captured the optimization patterns we had already understood, but they captured nothing else. Modern AI architectures, particularly those customized for specific 5G RAN tasks, frequently introduce structural patterns that fall outside any pre-enumerated rule: unconventional skip connections, operator orderings that hint at fusable subgraphs without matching a known template, or layer compositions where the optimal SIMD strategy is not a fusion at all but a re-tiling or a memory-layout transformation. For such graphs the compiler would still produce correct, deployable code, but the resulting inference latency was almost certainly suboptimal — the compiler had no way to recognize the optimization opportunity, let alone act on it.

Taken together, these limitations pointed in a single direction. What the compiler lacked was not more rules or more kernels, but the ability to reason about novel operators and unfamiliar graph structures the way a human expert does — by drawing on broad knowledge, recognizing patterns by analogy, and proposing optimization strategies that no rule had been written for. This is precisely the kind of open-ended inference that recent advances in large language models have begun to make tractable as a software component. Section 5 describes the multi-agent compiler that emerged from this question.

5. Second Generation: The Multi-Agent AI Inference Optimization Compiler

Closing the gap identified in Section 4 is not a matter of building a larger rule table or a richer kernel library — both approaches simply postpone the same scaling problem. What is needed is a different decision-making substrate, one that can interpret an unfamiliar operator from context, propose an optimization strategy by analogy with prior cases, and write the corresponding SIMD C++ code on the spot. These are the capabilities that modern large language models have begun to exhibit in software-engineering domains [9].

Even within the rule space the module-based compiler did cover, its decisions were locally optimal but globally rigid; an LLM-based reasoning component, by contrast, can weigh the broader graph context and recognize when a particular sequence would benefit from being split rather than fused given its tensor shapes and the target cache layout. Combining these two needs — open-ended operator handling and context-sensitive optimization — into a single monolithic LLM call is impractical. The natural design response is to decompose the task across multiple cooperating agents [10], orchestrated as a pipeline that mirrors the compilation flow.

5.1 A Feasibility Study: Can an LLM Write Production-Quality SIMD Kernels?

Before committing engineering effort to a full multi-agent architecture, we needed to answer a more fundamental question: could an LLM, given an appropriate description of the task, actually produce SIMD C++ kernel code competitive with that written by an expert by hand? Low-level intrinsic programming has long been considered one of the last domains where human expertise was indispensable. If the answer were no, no amount of orchestration logic could compensate, and the multi-agent direction would be a dead end.

The experiment compared LLM-generated SIMD kernels against the hand-written kernels already used by the module-based compiler, on two operators that exercise very different optimization patterns: Conv2D, which is dominated by structured data reuse, and Fully Connected, whose performance hinges on dense matrix-multiplication efficiency. We varied the amount of context supplied to the model across four prompt levels. Level 1 provided only the operator name and SIMD support. Level 2 added input/output shape, data type, and AI model architecture. Level 3 further included optimization strategy hints such as preferred loop tiling dimensions, register blocking, and memory alignment requirements. Level 4 supplied the full set of hardware information available to a human expert, including expected register assignment strategy, FMA pipeline utilization, and cache hierarchy details. We measured all performances on the isolated same single core of server-class Intel CPU with fixed frequency.

Figure 4. Across both operators, kernels generated at higher prompt levels approach or surpass the hand-tuned baseline, confirming that an LLM supplied with structured hardware and shape context can produce expert-quality SIMD code.

Fig. 4 summarizes the results. Across both operators and across the LLM models we tested, kernels generated at the higher prompt levels consistently surpassed the hand-written baselines. The improvement curve from Level 1 to Level 4 was almost monotonic, confirming that LLM-generated low-level code is highly sensitive to the precision of the contextual information it is given — a result with direct implications for how a multi-agent compiler should structure information flow into its codegen agents. The Level 4 kernels in particular, which received the same kind of structured hardware context that an expert engineer would assemble before writing SIMD code, demonstrated that the LLM was applying the supplied constraints to produce kernels tuned to the specific deployment scenario rather than merely pattern-matching against memorized snippets.

5.2 Architecture and Methodology of the Multi-Agent Compiler

The multi-agent compiler reconstitutes the work that the module-based system performed inside fixed Python modules into a collaboration of specialized reasoning agents, organized into functional groups and coordinated by a single Orchestrator. Each agent inherits a narrow responsibility within its group, but executes it through LLM reasoning grounded in structured context. To our knowledge, no prior AI compiler has been built around this design: not as a wrapper over an LLM that emits code, but as an end-to-end compilation pipeline whose every analysis, optimization, and code-generation decision is delegated to a reasoning agent. Table 1 summarizes the agent organization, and Fig. 5 shows the corresponding pipeline structure with its feedback loop.

Table 1. Functional Groups of the Multi-Agent AI Inference Optimization Compiler

Three methodological principles govern how AI is used inside this system, and together they distinguish it from a naive “compiler-as-prompt” approach. The first principle is strict task scoping. Each agent is given a tightly-bounded prompt template that defines what it may decide, what it must output, and — equally importantly — what it must not attempt. This containment prevents the cascade of silent re-interpretations that occurs when an LLM is given latitude beyond its station, and makes each agent’s behavior independently testable.

The second principle is the delegation of deterministic work to Python tooling. A surprising amount of what looks like reasoning in a compiler is, on inspection, mechanical: parsing ONNX graphs, extracting tensor shapes, comparing numerical outputs, measuring elapsed time. Tasking an LLM with these operations introduces randomness with no upside. We therefore identified every deterministic sub-task and implemented it as a Python script that the relevant agent invokes as a tool. The LLM is invoked only where genuine judgment is required: choosing a fusion strategy, drafting a kernel, diagnosing a bottleneck. This separation of judgment from computation is the single most important factor in keeping a multi-agent system reliable enough for production use.

The third principle is a retrieval-augmented memory of past compilations [11], complemented by a feedback loop that lets the system recover from suboptimal first attempts. Successful kernels and end-to-end compilations are written to a repository indexed along the dimensions that characterize a compilation request. When a new compilation begins, the corresponding script retrieves matching entries and supplies them as in-context examples to the strategy and code-generation groups. Retrieval is only one half of the story. Once the Verification group detects that the latency budget has been missed, the Orchestrator routes a structured diagnostic back to the Strategy group, which re-plans the optimization strategy with this new information in hand. Together, the repository and the feedback loop turn the compiler into a system that improves both across invocations (through accumulated experience) and within a single invocation (through measurement-driven re-planning) — a property that no rule-based or auto-tuning compiler can claim.

Figure 5. Architecture of the Multi-Agent AI Inference Optimization Compiler — Sequential execution of nine agents under a single Orchestrator, with a feedback loop on missed latency budgets.

5.3 Results and Discussion

We evaluated the multi-agent compiler on a representative collection of AI models drawn from 5G RAN signal-processing tasks, spanning three broad architectural families: compact DNNs, convolutional networks, and Transformer variants. For each model, we measured end-to-end inference latency under two configurations: expert-written hand-tuned C++ code and code generated by the multi-agent compiler. All measurements were performed on the same target single core of server-class Intel CPU with fixed frequency, averaged across 10,000 runs.

Fig. 6 summarizes the results, aggregated per architectural family. In every category, the multi-agent compiler produced lower average inference latency than the hand-tuned baseline. The fact that this advantage holds across architectural families spanning several orders of magnitude in workload size — without any per-family tuning — is itself a notable result: it shows that the compiler’s advantage is not specific to a particular workload class but generalizes across the diversity of models actually encountered in 5G RAN deployment.

Three factors, in our analysis, account for this outcome. The first factor concerns the scope of optimization reasoning. Hand-tuned code and the module-based compiler both ultimately rely on humans to look at a model and decide how to optimize it; for complex graph structures, the cognitive overhead of holding the entire computation graph in mind quickly exceeds what an engineer can reliably manage. The LLM-driven Strategy group can ingest the full graph as a single prompt, and reason about global optimization opportunities — cross-layer fusions, layout transformations spanning multiple operators, redundant data movement at non-adjacent nodes — that a human optimizer would miss not for lack of skill but for lack of bandwidth.

The second factor is the adaptability of code generation. The hand-tuned and module-based approaches are fundamentally constrained by the kernel library available to them. The multi-agent compiler suffers no such ceiling: the Code Generation group produces an AVX-512 SIMD kernel tailored to the exact operator, shape, and target microarchitecture of the request at hand, every time. The library is no longer a library but an open-ended generation capability — a structural difference that compounds as the diversity of input models grows.

Figure 6. The multi-agent compiler outperforms hand-tuned baselines across all architectural families.

The third factor is the cumulative effect of the methodological design choices. Strict task scoping prevents any single agent from drifting outside its competence. The delegation of deterministic operations to Python tools removes a class of failure modes that would otherwise erode reliability. The feedback loop gives the compiler the freedom to take an aggressive optimization plan on its first pass and recover gracefully when the resulting latency falls short — something a one-shot pipeline simply cannot do. And the RAG-style repository ensures that every successful compilation contributes to the system’s future performance. None of these mechanisms is individually exotic, but their combination is what made low-level SIMD code generation by an LLM consistently better than expert work.

6. Conclusion

This blog has traced two generations of an automatic AI inference optimization compiler designed to meet the microsecond-level latency demands of 5G RAN signal processing. The first generation, a deterministic module-based compiler structured as Model Parser, Optimizer, and Code Generator, automated the manual optimization workflow into a reproducible three-stage pipeline producing self-contained AVX-512 SIMD C++ kernels. The second generation, an LLM-driven multi-agent compiler, replaced the rule-based decision substrate with nine cooperating reasoning agents coordinated by an Orchestrator, governed by three methodological principles — strict task scoping, delegation of deterministic work to Python tooling, and a RAG-style repository — and equipped with a feedback loop that re-plans optimization strategies when the latency budget is missed. Across our evaluation set, the multi-agent compiler delivered inference latencies that were on average 46.4% faster than expert hand-tuned implementations.

What this work reaffirms is that bringing AI into the RAN is not solved by a well-trained model alone: equally critical is whether that model can run fast enough to live inside the real-time pipeline it was meant to serve. Our compiler addresses precisely this half of the problem, turning what was once a multi-week manual engineering effort into a single automated invocation that produces deployment-ready AVX-512 SIMD C++ code. Within Samsung Research’s AI-native RAN roadmap, this compiler occupies the deployment-enabling layer that lets every AI model actually reach the inference budget the radio interface demands. As the diversity and sophistication of RAN AI workloads continue to grow, we expect the same architecture to keep pace with them.

References

[1] C. -X. Wang, M. D. Renzo, S. Stanczak, S. Wang and E. G. Larsson, "Artificial Intelligence Enabled Wireless Networking for 5G and Beyond: Recent Advances and Future Challenges," in IEEE Wireless Communications, vol. 27, no. 1, pp. 16-23, February 2020.
[2] 3GPP, “Study on Artificial Intelligence (AI)/Machine Learning (ML) for NR air interface,” 3rd Generation Partnership Project (3GPP), Technical Report (TR) 38.843, Release 18, 2024.
[3] T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, M. Cowan, H. Shen, L. Wang, Y. Hu, L. Ceze, C. Guestrin, and A. Krishnamurthy, “TVM: An Automated End-to-End Optimizing Compiler for Deep Learning,” in Proc. 13th USENIX Symp. Operating Systems Design and Implementation (OSDI), Carlsbad, CA, USA, Oct. 2018, pp. 578–594.
[4] Intel Corporation, “OpenVINO Toolkit: Open-source toolkit for optimizing and deploying AI inference,” 2018. [Online]. Available: https://docs.openvino.ai/
[5] Intel Corporation, “Intel® Architecture Instruction Set Extensions and Future Features Programming Reference,” Order Number 319433-061, Mar. 2026.
[6] ONNX Project Contributors, “ONNX: Open Neural Network Exchange,” Github Repository, 2017. [Online]. Available: https://github.com/onnx/onnx
[7] D. Jin, “onnx-simplifier: Simplify your ONNX model,” GitHub Repository, 2019. [Online]. Available: https://github.com/daquexian/onnx-simplifier
[8] Google, “Protocol Buffers: A language-neutral, platform-neutral extensible mechanism for serializing structured data,” 2008. [Online]. Available: https://protobuf.dev/
[9] A. Fan et al., "Large Language Models for Software Engineering: Survey and Open Problems," in 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE), Melbourne, Australia, 2023, pp. 31-53.
[10] Q. Wu, G. Bansal, J. Zhang, Y. Wu, B. Li, E. Zhu, L. Jiang, X. Zhang, S. Zhang, J. Liu, A. Awadallah, R. White, D. Burger, and C. Wang, “AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation,” in Proc. Conf. on Language Modeling (COLM), 2024.
[11] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. Yih, T. Rocktäschel, S. Riedel, and D. Kiela, “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” In Neural Information Processing Systems (NIPS '20), Vol 33, 2020, pp. 9459–9474.

AI in 6G network: Service and System Aspect

Tingyu Xin|Hyesung Kim — Invalid Date

3GPP has been a pioneer in integrating and supporting on AI into network since early stages of 5G era. As the 5G specifications evolved, driven by the various use cases identified by SA1 Working Group (WG), the 3GPP Technical Specification Groups-including Radio Access Networks (RAN), Services & Systems Aspects (SA) and Core Network & Terminals (CT)-have been actively enabling the AI-related features across different standardization domains, aiming to improve the network performance and thereby to enhance user experiences. The RAN Working Groups focus on leveraging AI/ML to optimize performance of NR air interface and base station [1][2][3]. Meanwhile, 3GPP SA2 WG has been pivotal in leveraging AI to enable 5G Core Network automation; beyond this, SA2 also aims to provide assistance for application layer AI/ML-based services. Additionally, SA2 supports data collection for AI-based positioning to increase positioning accuracy and facilitate AI operation across different standardization domains of SA and RAN WGs [4], [5]. The cross-domain AI operation supports the involvement of User Equipment (UE), base station, core network, and application function (as a part of application server, supporting interaction with the core network). SA3 is in charge of AI/ML security related aspects. SA5 and SA6 WGs focus on AI/ML for network management and supporting AI/ML services at application layer, respectively. The CT WGs are responsible for specifying the protocols to fully enable AI/ML features over different interfaces, such as those between UE and the network, as well as between Core Network Functions.

Looking ahead to 6G era, 3GPP will further leverage cutting-edge AI technologies to design AI-native and AI-friendly wireless communication system, significantly enhancing network performance and user experience while supporting a wide array of new use cases. This article will primarily focus on AI in the Core Network within the domains of the 3GPP SA2. By delving into the details of key enablers of 6G AI, based on the objectives outlined in 6G Work Item Description (WID) [6] and documented in 6G Technical Report (TR) [6], this article aims to provide a comprehensive understanding of the future direction of AI in 3GPP Core networks.

Review and Limitations of 5G Standardisation Related to AI

5G Core Network Automation

SA2 initiated the standardization work related to core network automation by leveraging AI/ML since Release 15 and has continuously pursued to further deploy AI/ML techniques for enhancing core network intelligence over the later 5G releases.

The specified core network automation features aim to provide various network data analytics to their services consumers, Core Network functions (NF), OAM and even Application Functions (AF). The consumers can use the analytics information to decide and optimise their operations; and therefore, to enhance network performance and customer experience. SA2 specified a dedicated NF, Network Data Analytics Function (NWDAF) that produces the analytics as an output, which can be exposed to the service consumers. The 5G core network supports diverse use cases, for example, enhancement of QoS determination and QoE improvement, insight of UE mobility and behaviours, network signalling storm, etc. The corresponding analytics services are identified on the basis of Analytics IDs, noting that the NWDAF supports 23 analytics services in Release 19 [4].

The Analytics outputs, in the form of statistics and/or prediction, are derived by NWDAF by performing inference via trained ML models. The ML models are trained by NWDAF by collecting specified types of data from different data sources, for example other NFs, OAM and AFs, directly or via DCCF (Data Collection Coordination Function). To ensure the implementability, interoperability and flexibility of Core Network automation, ML models can be shared between different NWDAF vendors.

To protect user and network data privacy, and reduce the overhead and load of AI/ML operation, Federated Learning (FL) was introduced to 5GC. FL protects user and network privacy by keeping the raw data on local entities, instead of sending the raw data from different entities in different areas to a central server. Given the advantages of FL algorithm for communication system, in Release 18, the Horizontal Federated Learning (HFL) algorithm has been supported for ML model training and inference among multiple NWDAFs on a per Analytics ID basis. To further protect sensitive network and user information and enhance data efficiency, in Release 19, Vertical Federated Learning (VFL) was standardised to support more efficient ML operation with the involvement of NWDAF(s) and AF for different Analytics ID(s).

Other Aspects of AI in 5G Core Networks

• 5G Core Assistance to Application-Layer AI/ML Service

The fast development of AI/ML services has led to a significant increase in the amount of AI/ML-based services delivered to users via wireless communication networks. This evolution has introduced new commercial opportunities for Mobile Network Operators (MNOs), as the core network plays an important role in providing assistance to application-layer AI/ML services. By assisting these services, the MNOs can optimise the network resource scheduling and configuration optimisation for the massive and frequent transfer of AI/ML-related service data, while ensuring a high-quality user experience.

To enable Core Network’s assistance to AI/ML-based application functions, during Release-18, SA2 worked on 5G System architectural and functional extensions that facilitate more efficient monitoring of network resource utilization and expose the relevant information to AFs. Additionally, standardized 5QI to QoS characteristics mappings were specified for the AI/ML-based service data traffic. For instance, 5QI 6 can be used for AI/ML model download for image recognition. Furthermore, the Core Network provides APIs externally so that the AFs can provision AI/ML service-related requirements (e.g., time window for the AI/ML service transfer and the required QoS) to the Core Network. As a result, the 5GC can preconfigure resources for the corresponding AI/ML services through negotiation with AF to provide the application-layer AI/ML services in a service quality-guaranteed approach.

• Cross-domain AI Operation in 5GS

To explore full support of AI/ML for NR air interface in 5GS, close collaboration between SA and RAN WGs was pursued during Release 19 and 20. The major purpose of this work is to facilitate frequent and high volume of AI model training data transfer, while ensuring MNOs can potentially own the visibility and controllability of AI/ML-related data. In Release 19, SA2 specified architectural impacts and procedures for data collection from UE and gNB to location management function (LMF), located in the Core Network, to support AI/ML-based positioning (for LMF-side model and gNB-side model cases). In Release 20, SA2 has conducted a study on transferring AI model training data collected by the UE to an AI model training server (inside or outside 3GPP system) via user plane (UP) to support UE-side model training. However, the fundamental design of 5G UP, which is intended for transferring user data from the UE to the Data Network (DN) with the data load remaining transparent to the Core Network, poses significant technical challenges. Another significant challenge is lack of unified data collection and transfer mechanism among different domains within 3GPP. These inherent limitations of the 5GS architecture hindered the realization of cross-domain AI data transfer and cross-domain AI operation in the 5G era. Note that SA2 has studied potential mechanisms for data collection over UP, but decided not to pursue this normative work. More flexible and unified solutions are expected in 6G to support more comprehensive AI operation across different 3GPP WGs.

• Limitation of AI in 5G Core Network

In the current 5GC architecture, AI-based network automation is only supported in a centralised manner by NWDAF. Despite the LMF having the capability for AI-based positioning since Release-19, its application is limited to specific positioning scenarios.

The NWDAF-centralized approach of 5G Core network AI design also introduces certain limitations in terms of data collection, network flexibility and scalability. For instance, input data for ML model training has to be transferred from different NFs as data sources to NWDAF over control plane, which also results in a significant load for data collection. Furthermore, from a standardisation perspective, the NWDAF is an optional NF, resulting in the AI capability in 5G Core Network is an optional feature rather than a fundamental design. The 5G Core Network Architecture is shown in Figure 1.

The 5G Core network automation framework primarily focuses on NWDAF performing ML model training for given input data and output analytics. However, the framework is narrowly tailored to NWDAF-centric workflow, limiting its applicability as a general automation framework. Within this approach, introducing new AI-empowered NFs or AI algorithms will involve significantly standardisation work, which limits the development or adaptation of Core network to more diverse use cases. More importantly, the current design of 5G AI lacks extensibility to support the AI operations and use cases that require collaboration across different 3GPP WGs, in particular between Core Network and RAN WGs, often referred to as cross-domain AI operation.

To address these limitations and support more flexible and scalable AI operation within the 3GPP network, the 6G era is expected to introduce an AI-decentralised and AI-friendly network. This approach will provide native support to different AI technologies and AI operations. A unified data framework across the 3GPP system will be also introduced to facilitate comprehensive and flexible data transfer among users, networks and application servers. By leveraging advanced AI technologies and unified data frameworks, 6G networks will move beyond the traditional role of transferring data, evolving into active enablers of next-generation services. This transformation will allow the Core Network for 6G to play a more integral role in delivering value-added services, optimizing resource utilization, improving network management, and enhancing user experiences.

Figure 1. 5G System Architecture with centralised AI design

Introduction of 6G AI

As more AI services emerge, the AI-related service data being transmitted through communication systems is increasing rapidly. At the same time, AI is now being widely adopted in various industries, transforming workflows and improving system efficiency. This trend has made AI a key focus in 6G discussions from the very beginning, with the support and interests from mobile network operators, mobile vendors and network vendors. To address this, 3GPP SA2 WG has agreed to study AI in the Core Network as a major key issue starting from 6G day 1. The detailed key issue description is documented in 3GPP TR 23.801-01 [7]. Building on the limitations of 5G highlighted earlier, this section will explore the potential key enablers of 6G AI that can not only overcome the existing challenges but also meet new requirements for 6G.

AI for 6G Core Network

• AI-Enabled Core Network Functions

One of the fundamental enablers for 6G Core Network is to support the AI-powered core Network Functions (NFs). In 5G, the NWDAF is the main NF supporting network automation; the analytics provided by NWDAF based on ML inference is mostly only used as assistance information by the consumer NFs for decision-making. For instance, the Policy Control Function (PCF) determines QoS parameters based on the NWDAF analytics of network conditions, service requirements, user status and plenty of other information such as QoS monitoring. However, such a complex procedure, multiple entities – including different NFs, base station, AFs, UE – are all involved. Each entity operates its own the information and their internal logic to decide the standardised actions based on different triggers and interactions with others, which can lead to inefficiencies. To address this, 6G aims to support AI-enabled network functions that allows each NF to make independent decisions using AI, considering complex and comprehensive information. Additionally, the AI-enabled NFs can collaborate or federate with other entities in a distributed manner to solve complex tasks.

Figure 2. Distributed AI design in 6G System

AI-powered network functions is the fundamental enabler of supporting AI-native network and providing AI as a service. The AI-powered network functions will be capable of performing AI operations such as AI model training, inference, performance monitoring and model update, and even model provisioning to other entities. Unlike the 5G centralised AI-design, the distributed AI in the 6G Core Network reduces the need for massive data collection and high-computing power by a single NF. The AI-powered 6G network functions are able to leverage local data and require minimal data from other sources to leverage AI on making decision locally and in real-time.

The AI-powered NFs can also collaborate with other entities to solve complex tasks within Core Network or across different domains. For instance, for QoS determination, instead of requiring extensive information from the multiple involved entities, AI-powered NFs (e.g. PCF, SMF, UPF, etc.), RAN node and OAM can work in conjunction to train joint AI model and perform joint inference. To enable this type of AI-native operations, enhancements to the network architecture are necessary in 6G, such as standardised interaction between AI-enabled entities including interfaces and the information to be transferred. The existing interface, like SBI or NGAP, could potentially support control information exchange. The potential enhancement to the interfaces or whether new AI- related protocols are required will be further studied by SA2. The enhancements will also facilitate efficient data collection and transfer for AI operations across different domains, as well as seamless interaction between AI-powered functions and non-AI entities. The exchange of contextual information to optimize AI-driven decisions should also be supported to ensure the system performance.

The distributed AI-powered NF design in 6G Core Network will enable a flexible way of working between AI and non-AI entities, optimizing network performance and delivering high-quality services to customers. At the same time, the AI capabilities of the network will also allow it to provide AI as a service to its consumers.

• Agentic Core Network

In 3GPP SA2, one of potential ways of utilizing AI technology for network is to introduce AI Agent as a part of the 6G core network. In AI academia, ‘agentic’ in general refers to autonomous, planning, and decision-making capabilities to achieve complex goals with minimal or without human guidance. The term AI agent in the Core Network refers to an entity that autonomously performs tasks on behalf of UE, systems and/or applications. There is growing interest in integrating advance AI technologies, such as AI agents, to enable automated core network operations by comprehending the varied and dynamic requirements of subscribers, as well as considering situational factors like network congestion status. Moreover, each subscriber’s UE would be in different context with respect to, e.g., application service in use, mobility status, and subscribed charge plan, which leads to necessity of satisfying subscriber-customized service requirement.

The 5G core network has limitations in addressing the aforementioned subscriber-customized service requirement. It processes service requests based on predefined or semi-dynamically generated rules with operational flows between core network components executed sequentially. In 6G, given that the service requirements of each subscriber would be more diverse, the operation mode that adheres to non-flexible rules becomes limited when a subscriber's service needs change as its context. Additionally, it is anticipated that not only traditional connectivity connection services but also a broad range of services, i.e., beyond connectivity, leveraging network infrastructure, such as sensing or computing services, can be provided. The requests are not limited to the 3GPP specified requests in 5G and earlier release, e.g. a registration request, PDU session request, etc.; in addition, the request may include more general information or user intent, for instance the user would like to play a VR game with a high expected QoE (Quality of Experience). The complexity in network control, driven by the delivery of diverse services, is expected to become higher. This has brought heightened interest in leveraging AI technologies to effectively address this challenge.

Based on the above discussion, SA2 has agreed to study the following major aspects on enabling 6G core network to leverage AI capabilities:

How to fulfill requests from UE or application function, noting that the requests may be in the abstracted form of intent, which refers to expectations including requirements, goals, conditions, guidelines, and constraints without specifying how to achieve them. This intent-based request from UE or AF will be defined as part of this study. SA2 6G study assumes that it is not required for the modem stack of UE to produce nor understand the intent, also noting that modem stack of UE is assumed to be agnostic to whether the 6G network uses AI capable entities to address UE requests. Additionally, the constraints on the use/expression of intent-based request will be also discussed to avoid ambiguous processing and interpretation of intent in the 6G core network.

How to enable AI capable entities in 6G core network to dynamically compose parts of procedures to fulfil request from UE or AFs with modularization of the system procedures over the core network. Overall design principles and constraints for the modularizing and composing the procedures will be discussed to guarantee the stable and reliable core network operations when AI technology is used.

For the aforementioned study in SA2, it has been established as a requirement that the AI for the 6G architecture shall be multi-vendor interoperable, reliable, and sustainable.

Enabling an agentic core introduces new challenges to 6G standardization. Unlike 5G and earlier generations, where the Core Network operates based on predefined procedures and MNOs configurations, ensuring stable operability with autonomous AI agent decisions becomes a critical issue. In this regard, performance monitoring, governance and stability aspects are expected to become more important topic. This can be also entangled with how to support lifecycle management of agentic core from OAM perspective. Additionally, modularizing predefined procedures and enabling dynamic composition of these ‘modules’ represents a different logic in specifying 3GPP procedures compared to previous generations. SA2 will assess the feasibility of agentic core network proposals for 6G Day1, incorporating detailed technical inputs from supporting companies during the 6G study phase in 2026.

6G Network for AI Services

• AI Agent-Communication

As more AI-powered equipment or systems and AI-related service emerge, some MNOs identified the new commercial opportunities to support the more advanced communication scenarios, such as the communication and interactions between robots or AI agent on UE. The MNOs expect to support the large scale, distance and flexible and dynamic scenarios for the communication between the AI agent on UEs. The communication between AI agent on UEs is not only limited to the proximity scenarios and requires dynamic control and broader connectivity that cannot be met by ProSe (Proximity-based services) and 5G-LAN (Local Area Network) that have been specified in 5G and earlier generations. Furthermore, instead of only using the 3GPP system as data pines for data transfer, the 3GPP network can play a much more significant role in controlling, managing and configuring the AI agent communication services, leveraging its rich information related the UEs, network and the application server.

The key issues to explore in 6G include:

whether and how an AI agent on one UE can discover another AI agent on a different UE through the 6G network.

whether and how to enable communication for AI agents on different UEs via the 6G network(s) e.g., identification and authorization of an AI agent on a UE.

whether and how to enhance network capability exposure functionalities to AI agent on AF(s).

For AI agents on UEs, the goal is to develop robust protocols that facilitate efficient and secure communication between AI agents, leveraging the advanced capabilities of 6G networks. The above key aspects involve understanding the mechanisms for enabling the AI agent discovery based on the required criteria of the discoverer, ensuring identification and authorisation of the discovered AI agent on UEs in various scenarios, and establishing secured communication connections between AI agent on UEs. One of the challenges is to study how to identify and authorize AI agents on UEs. Unlike the typical UEs, the AI agent on UE may have dynamic identification information. Ensuring that only the legitimate agents are allowed to be discovered and connected is critical for network and communication security. Another critical aspect is enabling communication between AI agents on different UEs via the 6G network, for example by allocating resources and determining appropriate policies and configuration for better support of the communication between AI agents on UEs.

Additionally, enhancing network capability exposure functionalities to AI agents on Application Functions (AFs) is another important aspect that proposed by some MNOs. This involves studying what the key network capability information will be the most beneficial for AI agents on AF and how to effectively expose this information to the AF. It is assumed that the network capability information will be potentially help the AF to understand whether the Core Network is capable to provide additional support for AI agents operation, information related to dynamically access and network resource utilisation and operation status, etc. By providing this information to the AF, the AF can optimise and plan its tasks; the network might be more informed of the AF decisions that is helpful for optimising network organisation and efficiency.

• Network for AI Model Training and Inference as Services

Another key focus on 6G is to support the 6G network to provide AI as a service. Particularly for AI model training and inference for various application-layer AI use cases. As highlighted earlier, the 6G network will be AI integrated and natively AI-enabled thanks for to the support of AI-enabled NFs. and therefore, the fundamental 6G architectures is assumed to be capable to facilitate efficient model training and inference natively. Providing model training or inference as a service to subscribers/users or application servers will enhance the utilisation of 6G network capabilities and also enhance the diversity of the services can be provided by MNOs.

The growing volume and diversity of AI traffic in 3GPP network raise questions about whether special handling required in 6G based on network awareness of AI traffic. The AI traffic characteristics are potentially different from most traditional services, and there might be further differences among various AI traffic. For instance, the requirements for the traffic of tokenised communication, AI model training, inference can be different. In general, inference traffic might be delay-sensitive, while model training traffic might be high volume and frequent transmission. Additionally, different tokens for the same service may have varying error tolerance. However, considering 5G already supported XR traffic and specified basic support of application-layer AI service (such as enhanced 5QIs), whether additional special handling is required or not in SA2 domain will be based on the further identification of the AI traffic characteristics by other 3GPP WGs. This analysis will help SA2 to determine whether necessary enhancements in 6G will be required or not.

Data Framework for AI Use CasesGiven the high data volumes in general associated with AI services, transferring such data over the existing control plane (CP) could lead to congestion. While the current user plane, although capable to handle relatively high volume data, lacks the data controllability and visibility at core network required by MNOs cannot be effectively achieved. Additionally, UP mechanism struggles to transfer the data from UE or base station to NF, which is critical for some use cases such as UE data transfer for UE model training and sensing data transfer from RAN (e.g., RAN node-based sensing).

To address the limitations of 5G that create obstacles for the cross-domain AI standardization and better support the AI enablers highlighted for 6G, it is essential to enable a flexible and unified data transfer framework. This data framework will enable seamless data transfer between entities within 3GPP system and other entities inside or outside 3GPP system. This approach can help resolve existing bottlenecks of cross-domain data transfer in 5G.

The 6G Data framework will be significant topic in SA2 6G study. While we will not explore the detailed technical aspects of the data framework in this article, it is worth highlighting its potential support for AI-related use cases. The data framework aims to provide flexible data or related information collection and transfer across different domains, including the core network in SA2 domain, the physical and radio layer in RAN1 and RAN2 domain, the network management data in SA5 domain, etc. By facilitating the streamlined collection, transfer, and sharing of AI-related data, the data framework ensures efficient and secure communication among distributed AI-enabled NFs, AI-powered RAN node, different types of user device, and 3rd party servers. This approach will play a crucial role in enabling AI services in 6G networks.

Summary, Challenges and Future Work

This article delineates the key enablers of AI in 6G Core Network that address the highlighted 5G limitations while paving the way for more efficient and intelligent communication systems for new services and future challenges.

One of the fundamental enablers is to support distributed AI-enabled network functions (NFs), which allow each function to make AI-based decisions independently and additionally collaborating with other AI-enabled entities in a federated manner.

Introducing AI agent into the core network will be studied in 6G. This will support the interactions network with users or AF using intents. Furthermore, the AI agent could dynamically compose procedures to fulfill the requests from user and AF, which may change the way of 3GPP core network standardisation.

Various AI-related use case will be explored during 6G study. For instance, the advance communication between AI agents on UEs, network capability exposure to AI agent on AFs. The core network may also provide the AI as a service by collaborating with UEs or application severs to perform joint AI operation, such as AI model training and inference.
Although AI-related traffic will be transferred via 6G system, whether specially handling of the AI traffic is necessary will depend on the analysis on AI traffic to be done by other WGs.

New data framework that can facilitates flexible and efficient data collection between the entities across different domains will provide fundamental support of the above AI-related features and services in 6G.

Several issues remain to be addressed for 6G AI implementation, including charging models for AI traffic, validation of AI performance, and ensuring stability and reliability of AI-based operations.

Among these, AI traffic charging represents a particularly critical concern for MNO. Unlike traditional traffic, AI traffic often has unique characteristics, such as high data volumes for model training or low-latency requirements for inference tasks. Furthermore, AI might be used by the MNOs for improving network performance which requires UE to transfer data or information to support the MNOs operation. Developing a fair and efficient charging mechanism for AI traffic will required. This work will mainly fall into the SA5 domain.

Ensuring the stability and reliability of AI-based operations is one of the most challenging issues in 6G. AI systems must operate consistently and predictably, even when faced with dynamic network conditions or unexpected events. This requires developing mechanisms to monitor and manage AI performance in real-time, ensuring that models remain accurate and effective. Techniques such as reinforcement learning can be helpful. Models should be updated when they fail to meet accuracy and other performance criteria. Additionally, robust security measures should be considered to protect AI operation from potential threats, which will fall into SA3 domain.

Validation of AI performance is essential to ensure that AI-based solutions meet the required standards for accuracy, efficiency, and reliability. From a standards perspective, the performance validation may involve integrating AI into network operations, ensuring that AI-based decisions and actions align with specified policies and procedures. From an implementation perspective, this may involve developing methodologies to test and verify the performance of AI models in real-world scenarios, particularly in complex and dynamic environments, by both vendors and MNOs.

In conclusion, the 6G AI represents a transformative step forward, enabling more intelligent, efficient, and flexible communication network. By addressing key enablers such as distributed AI-enabled network functions, unified data frameworks, and support of advanced AI agent communication, 6G is positioned to overcome the limitations of 5G and support a wide range of advanced AI-driven scenarios. However, challenges such as charging for AI traffic, stability and reliability of AI-based operations, and AI performance validation must be carefully addressed. 3GPP SA2 WG will continue investigating detailed solutions for each single technical issues.

References

[1].RP-251870, New WI: Artificial Intelligence (AI)/Machine Learning (ML) for NR air interface enhancements. Prague, Czech Republic, June 9-13, 2025, 3GPP TSG RAN Meeting #108
[2].RP-251864, Artificial Intelligence (AI)/Machine Learning (ML) for mobility in NR. Prague, Czech Republic, June 9-13, 2025, 3GPP TSG RAN Meeting #108.
[3].RP-213602, New WI: Artificial Intelligence (AI)/Machine Learning (ML) for NG-RAN. Dec. 6 - 17, 2021, 3GPP TSG RAN Meeting #94e
[4].3GPP TS 23.288, Architecture enhancements for 5G System (5GS) to support network data analytics services..
[5].3GPP TS 23.273, 5G System (5GS) Location Services (LCS).
[6].SP-250806, Study on Architecture for 6G System. 10 - 13 June, 2025, Prague, Czech Republic, TSG SA Meeting #108.
[7].3GPP TR 23.801-01, Study on Architecture for 6G System; Stage 2, V0.3.0 (2025-11).

Distributed Multiple-Input Multiple-Output (D-MIMO) for Ubiquitous Uplink Performance

Invalid Date

1. Introduction

Driven by the rapid surge in user-generated traffic, such as live video streaming, Extended Reality (XR)/ Virtual Reality (VR), and autonomous Artificial Intelligence (AI) agents, ubiquitous and reliable uplink (UL) performance has become significantly important in modern wireless networks [1]. UL D-MIMO is gaining attention as an enabler for addressing these requirements. UL D MIMO leverages coordinated reception from multiple transmission–reception points (TRPs) to enhance uplink performance. As illustrated in Figure 1, in conventional UL, a single user equipment’s (UE’s) signal is only processed by one TRP, causing interference to neighboring TRPs. In contrast, UL-DMIMO allows nearby TRPs to act as helpers, capturing the UE’s uplink data and sharing it with the serving TRP for combining. The joint reception enhances the received signal power strength, and transforms conventional strong inter-cell interferer into collaborating user, therefore significantly improve spatial diversity and interference suppression.

Figure 1. Illustration of UL D-MIMO system

Although UL D MIMO has not yet been standardized in 3GPP, related concepts such as uplink Coordinated Multi-Point (CoMP), multi-TRP reception, and distributed massive MIMO are being actively studied in both academia and industry. Vendors and operators increasingly regard UL D MIMO as a key enabler for achieving performance-assured connectivity and supporting uplink heavy use cases. This perspective drives further investigation into advanced UL diversity reception techniques [1-5]. Recent industry studies show that D-MIMO can significantly improve throughput and coverage in a variety of scenarios, including cell-edge users, dense deployments with strong inter-cell interference, and higher order of UL MIMO configurations.

In our previous blog, we discussed the D-MIMO technology for the downlink (DL) side [6]. In this blog, we explore the UL D-MIMO system and its various key features. Specifically, we discuss several signal combining methods, examining their trade-offs and scalability in terms of UL performance gains versus fronthaul bandwidth and computational complexity. Additionally, we present advanced scheduling schemes that incorporate dynamic UE centric clustering, adaptive resource allocation, and link adaptation to further enhance the performance of UL D MIMO systems.

2. Technical Challenges

UL D MIMO systems encounter challenges at both the physical (PHY) and medium access control (MAC) layers. Geographically distributed TRPs lead to significant variations in signal quality, interference, synchronization, and channel reliability. These variations necessitate carefully designed equalization and combining algorithms at the PHY layer to ensure robust performance. Additionally, they introduce more complex requirements for MAC layer design, such as UE scheduling and TRP selection. Below, we identify the specific challenges in more detail at both the PHY and MAC layers.

Physical layer:

Fronthaul data transfer: UL D-MIMO requires transporting I/Q samples from many distributed Radio Units (Rus) and Massive MIMO Units (MMUs) to a centralized Distributed Unit (DU) for joint processing. In the O-RAN 7-2x split [7], the fronthaul carries I/Q samples for each antenna port. For wideband Orthogonal Frequency Division Multiplexing (OFDM) and massive antenna arrays, this results in significantly high fronthaul bandwidth requirements, scaling linearly with both the system bandwidth and the number of antennas.

Equalization computation complexity: At the DU, UL D-MIMO joint reception integrates signals from potentially dozens of RUs and MMUs, each with many antenna elements. MMSE-based equalization is attractive for interference suppression and spatial diversity exploitation. However, the computational complexity of fully centralized equalization grows cubically with the total number of receive antennas, making large-scale implementation computationally prohibitive.

Layer 2 and above:

MU-MIMO in TRP group: Conventionally, MU-MIMO is used to increase the cell throughput in the single cell operation. Since UL D-MIMO can apply centralized scheduling in TRP group with extended spatial domain, inter-cell interference can be managed by MU-MIMO operation across the multi cells in a TRP group. But TRP group-based MU-MIMO operation can increase the scheduling complexity for the centralized scheduling.

Dynamic UE-Centric Clustering and Resource Allocation: In user-centric UL D-MIMO networks, the serving TRP cluster for each UE changes over time based on channel conditions and mobility. Consequently, scheduling becomes a challenging task, as it must account for all possible combinations of UEs and TRPs, resulting in high combinatorial complexity

Link adaptation: Under the UL D-MIMO framework, link adaptation must predict a reliable effective Signal-to-Interference-plus-Noise Ratio (SINR) by considering multiple TRPs with varying channel conditions, receiver algorithms, and dynamic TRP selection results. As a result, the link adaptation process transforms from a straightforward mapping problem under single TRP scenarios into a complex dynamic prediction problem involving uncertainty and varying parameters.

3. Fronthaul Efficient Joint Combining with Scalable MMSE Equalization

To address the PHY layer challenges discussed above, practical UL D-MIMO systems require new architectural and algorithmic solutions that reduce both fronthaul bandwidth and centralized processing complexity. This section evaluates three combining schemes. Based on the selected combining methods, locally processed data - raw I/Q samples, log likelihood ratio (LLR), and equalizer outputs - are forwarded to the DU, where the centralized unit performs joint signal combining and decoding.

IQ combining (IQC): All received signals are forwarded to the DU as raw IQ samples for centralized processing tasks, including channel estimation and equalization. Local processing is not performed at the RU or MMU. While this scheme achieves the full MMSE joint processing gain, it necessitates significant fronthaul data transfer and substantial processing complexity at the DU.

LLR combining (LLRC): Local pre-equalization is performed at the RU and MMU before fronthaul transport. The DU only combines the LLRs from different TRPs. While this scheme reduces fronthaul traffic and the DU’s processing burden, it may sacrifice overall system performance due to the limited information forwarded.

Distributed equalization combining (DEQC): This method employs partial beamforming or local pre-equalization at the RU/MMU before fronthaul transport, similar to the O-RAN 7-2x DMRS-BF-EQ mode [8]. The DU subsequently performs additional joint combining on the reduced-dimension signals. Such approaches effectively reduce both fronthaul load and MMSE complexity while retaining most of the benefits of full MMSE joint processing.

The performance of UL D-MIMO is evaluated using link-level simulations across three combining methods.

Figure 2. Throughput comparisons among three combining schemes

Figure 3. Fronthaul bandwidth requirement and overall computational complexity in terms of floating point operations per second (FLOPS) for Physical Uplink Shared Channel (PUSCH) operation among three combining schemes

From Figure 2 and Figure 3, it is evident that IQC delivers the best performance, albeit with the highest complexity and fronthaul overhead. In contrast, LLRC requires approximately 10% of the fronthaul bandwidth and 64% of the computation complexity of IQC, yet achieves less than 30% of IQC’s throughput performance. While LLRC significantly reduces overhead, it provides only modest performance improvement. DEQC, on the other hand, offers a balanced tradeoff, achieving 72% of IQC throughput while consuming just 15% of IQC's fronthaul bandwidth and 64.8% of its PUSCH FLOPS. This significant amount of fronthaul reduction is particularly advantageous for D-MIMO deployments, where fronthaul capacity is often the critical bottleneck – such as in indoor scenarios with high TRP density. Additionally, the 35% reduction in PUSCH FLOPS alleviates the centralized processing burden at the DU, enabling scalability to larger TRP counts. Thus, DEQC effectively balances fronthaul efficiency, scalability, and joint combining gains, closely approaching centralized MMSE performance while substantially reducing implementation complexity. This makes DEQC a practical and scalable solution for D-MIMO deployments where both fronthaul and compute resources are constrained.

In addition, Figure 4 shows the Reference Signal Received Power (RSRP) coverage benefits in an office environment. The left figure illustrates the measured RSRP of a conventional single-TRP system, which suffers from numerous coverage holes. The right figure, however, demonstrates the RSRP when four TRPs are placed to maximize overall coverage. This clearly highlights the key advantage of UL D-MIMO: uniform RSRP coverage across the operating area.

Figure 4. RSRP heatmap within an office, D-MIMO providing uniform coverage

4. UE-Centric Scheduling

4.1 Centralized Scheduler Design for UL MU-D-MIMO

In UL MU-D-MIMO, the centralized scheduler performs the MU-MIMO scheduling for UEs in a TRP group by treating the TRPs in the TRP group as a single cell with more antennas, which reduces the inter-cell interference introduced by TRP’s in the same group and increases both average cell throughput and cell edge UE throughput. Furthermore, the centralized scheduler selects the TRP’s for each UE. For instance, a two-stage approach can be applied to reduce complexity: TRP selection followed by UE pairing and resource allocation whereas a joint approach can achieve better performance with higher complexity.

4.2 Link Adaptation for Coherent Uplink Joint Reception

Since the received signal quality varies due to the dynamic changes in selected TRPs for joint reception within a TRP group, a more sophisticated link adaptation mechanism is required for UL D-MIMO to determine the optimal Modulation and Coding Scheme (MCS) level. For instance, the combining SINR for the selected TRPs can be predicted using channel quality estimates derived from UL signals, such as the Demodulation Reference Signal (DMRS) and Sounding Reference Signal (SRS). This enables the centralized scheduler to determine the optimal MCS for the selected TRP combination, ensuring efficient and reliable communication.

Figure 5. Example of TRP changes for joint reception

5. Conclusion

This blog explores the motivation and concept of UL D-MIMO system, trends of UL D-MIMO studies and challenges. As a practical solution that addresses the physical layer’s challenging issues, we have developed an innovative joint combing for UL D-MIMO systems, and have verified its complexity, performance, and bandwidth requirement of the fronthaul. Future work will focus on spatially varying interference, where RUs and MMUs observe different interference conditions due to local interference, neighboring cells, and asynchronous users.

References

[1] E. Björnson, J. Hoydis, and L. Sanguinetti, “Massive MIMO Networks: Spectral, Energy, and Hardware Efficiency,” Foundations and Trends in Signal Processing: Vol. 11, No. 3-4, pp. 154–655.
[2] K. S. Bondada, U. Saeed, Y. Liang, D. J. Jakubisin, L. Liu, and R. M. Buehrer, "Distributed Uplink Joint Transmission for 6G Communication," 2025 Joint European Conference on Networks and Communications & 6G Summit (EuCNC/6G Summit), Poznan, Poland, 2025, pp. 554-559.
[3] F. Kronestedt, T. Chen, A. Kaur, and A. Furuskär, “Enhancing 5G uplink performance to enable differentiated services,” Ericsson Technology Review #9, 2025.
[4] “Switched Uplink in 5G-NR: Benefit & Deployment Consideration”, Qualcomm, 2023. [Online]. Available: https://www.qualcomm.com/content/dam/qcomm-martech/dm-assets/documents/5G-Whitepaper-Switched_Uplink_in_5g_Benefits_and_Considerations-Qualcomm.pdf
[5] “The physical layer foundations powering 6G”, Nokia Whitepaper, 2025. [Online]. Available: https://www.nokia.com/asset/214991/
[6] “UE-Centric Distributed MIMO for 5G and Beyond - Benefits, Challenges, and Promising Solutions”, Samsung Research, 2025. [Online]. Available: https://research.samsung.com/blog/UE-Centric-Distributed-MIMO-for-5G-and-Beyond-Benefits-Challenges-and-Promising-Solutions
[7] “Overview of O-RAN Fronthaul Specifications”, NTT DOCOMO Technical Journal Vol. 21, 2019. [Online]. Available: https://www.docomo.ne.jp/english/binary/pdf/corporate/technology/rd/technical_journal/bn/vol21_1/vol21_1_007en.pdf
[8] O-RAN Alliance, “O-RAN WG4 Control, User and Synchronization Plane Specification,” O-RAN.WG4.TS.CUS.0-R005-v20.00, 2026. [Online]. Available: https://specifications.o-ran.org/download?id=1033

When One Sensor Learns Another: Cross-Modal AI for Wearables

Illia Fedorin|Margaryta Nastenko|Oleh Semchuk — Invalid Date

Wearable devices are expected to deliver increasingly accurate health monitoring while remaining compact, lightweight, and power efficient. However, combining multiple sensing modalities often introduces additional hardware complexity, battery consumption, and robustness challenges.

Among physiological sensors, photoplethysmography (PPG) remains one of the primary technologies for heart rate monitoring. At the same time, PPG is highly sensitive to motion artifacts, temporary signal degradation, and increased power usage during continuous tracking. Accelerometer (ACC) sensors, in contrast, are significantly more robust and energy efficient, but they do not directly measure cardiovascular activity [1-3].

This raises an important research question: can one physiological sensor learn representations that are usually provided by another modality?

Our recent work, presented at ICASSP 2026, explores this concept through a lightweight cross-modal virtual sensing framework for wearable devices. Instead of treating physiological sensors as isolated data sources, we investigated whether synchronized sensor streams could learn latent relationships and compensate for missing or degraded signals.

Motivation: Towards Virtual Physiological Sensing

Modern wearable devices increasingly operate under strict hardware constraints. Compact form factors such as smart rings, earbuds, or lightweight fitness wearables may not always support a full sensor stack. Even when optical sensors are available, motion-heavy activities can severely reduce signal quality.

Traditional multimodal systems rely on explicit sensor fusion: combining ACC and PPG simultaneously to improve robustness. While effective, such approaches still assume that all sensors remain available and reliable during inference.

In our work, we explored a different direction: virtual sensing.

The core idea is to reconstruct or infer physiological information from an alternative modality when the primary signal becomes unavailable, corrupted, or intentionally disabled for power saving.

To demonstrate this concept, we investigated two complementary tasks:

reconstructing virtual PPG-related representations from accelerometer signals,

generating pseudo-motion embeddings from optical signals for motion-aware denoising.

This creates a bidirectional cross-modal framework where one modality can partially compensate for another depending on device constraints and sensing conditions.

High-Intensity Wearable Data Collection

The experiments were conducted using synchronized wearable recordings collected during structured high-intensity interval training (HIIT) sessions. The dataset included:

132 workout logs,

3-axis accelerometer signals,

4-channel PPG signals,

reference heart rate from an ECG-grade chest device.

Unlike controlled laboratory recordings, these sessions contained rapid transitions between sprint and recovery phases, strong wrist movement, and varying physiological responses across participants.

This created a particularly challenging scenario for robust wearable heart rate estimation.

Cross-Modal Virtual Sensing Framework

The proposed framework uses a shared lightweight temporal encoder trained across different sensing directions. As illustrated in Figure 1, the framework learns shared latent representations across physiological modalities.

Figure 1. High-level cross-modal virtual sensing framework for wearable physiological inference.

The system supports:

ACC → virtual PPG reconstruction,

PPG → pseudo-motion embedding generation,

modality-aware denoising,

single-modality real-time inference.

A key aspect of the approach is that the model learns relationships between synchronized modalities during training, while remaining capable of operating with only a single modality during inference.

This allows the framework to support:

sensor dropout scenarios,

degraded sensing conditions,

reduced hardware configurations,

low-power wearable deployment.

Figure 2. Cross-modal spectral reconstruction pipeline with adaptive attention and temporal modeling.

Technical Challenges

One of the main challenges was the severe level of motion corruption present in wearable physiological signals during high-intensity exercise.

PPG signals become unstable under rapid wrist movement, while accelerometer data contains large amounts of non-cardiac motion noise. As a result, extracting heart-rate-related information from ACC alone becomes extremely difficult.

Another important constraint was computational efficiency.

The framework was designed for real-time wearable deployment, meaning that latency, parameter count, and memory footprint had to remain minimal while still preserving meaningful physiological representations.

Balancing:

robustness,

efficiency,

cross-modal generalization,

and real-time inference

became one of the central engineering challenges throughout development.

Results

The experiments demonstrated that cross-modal learning can significantly improve physiological inference under partial sensing conditions (see Table 1).

Table 1. Heart rate estimation performance across different sensing configurations.

The proposed virtual sensing approaches substantially improved ACC-only heart rate estimation and approached the performance of full multimodal fusion systems. Notably, the proposed framework narrowed the gap between single-modality inference and full multimodal fusion despite operating under severe motion conditions.

Key observations included:

strong improvement over raw ACC-only estimation,

near fusion-level performance in several configurations,

stable real-time inference,

robust operation under high-motion conditions,

efficient deployment characteristics suitable for wearable hardware.

The framework also demonstrated that attention-based refinement can transform noisy motion representations into more physiologically meaningful latent structures.

Importantly, the goal was not to literally reconstruct raw optical signals, but rather to learn latent physiological representations that preserve heart-rate-related dynamics across modalities.

Figure 3 demonstrates how cross-modal refinement transforms noisy accelerometer representations into physiologically meaningful structures.

Figure 3. Attention-based refinement suppresses motion artifacts and enhances HR-related spectral structure. Top: raw ACC/PPG signals; bottom: denoised ACC representations using attention and VAE-based refinement.

Beyond Heart Rate Monitoring

Although this work focused on wearable heart rate estimation, the broader concept extends beyond a single sensing task.

Cross-modal physiological learning opens possibilities for:

adaptive sensing systems,

reduced sensor stacks,

fault-tolerant wearable inference,

low-power monitoring,

and more flexible multimodal health devices.

Future wearable systems may increasingly rely on virtual sensing approaches, where available modalities dynamically compensate for unavailable ones instead of depending on fixed sensor configurations.

This direction becomes particularly relevant for next-generation compact devices where battery capacity, physical size, and sensing hardware remain highly constrained.

Conclusion

Our work explored how synchronized wearable sensors can learn latent physiological relationships through cross-modal training.

By enabling one modality to partially infer another, the proposed framework demonstrates a step toward more adaptive, robust, and hardware-efficient wearable AI systems.

Rather than relying solely on explicit sensor fusion, future wearable devices may increasingly use learned physiological priors to maintain reliable monitoring under real-world constraints.

Related Publications

ICASSP 2026: Learning Cross-Modal Physiological Signals on Wearables,
https://ieeexplore.ieee.org/document/11462938/

Information Fusion (extended work): Virtual PPG Reconstruction from Accelerometer Data via Adaptive Denoising and Cross-Modal Fusion,
https://www.sciencedirect.com/science/article/abs/pii/S1566253525008437

References

1.Fedorin, V. Pohribnyi, D. Sverdlov, and I. Krasnoshchok, “Lightweight neural network based model for real-time precise HR monitoring during high intensity workout using consumer smartwatches,” IEEE EMBC, 2022.
2.Fedorin, A. Smielova, M. Nastenko, and I. Krasnoshchok, “From Sprint to Recovery: LSTM-Powered Heart Rate Recovery Forecasting in HIIT Sessions,” IEEE EMBC, 2024.
3.Fedorin, K. Slyusarenko, V. Pohribnyi, J. Yoon, G. Park, and H. Kim, “Heart Rate Trend Forecasting During High-Intensity Interval Training Using Consumer Wearable Devices,” ACM MobiCom, 2021.

Online Cursive Handwriting Generation Using Trace Transformation and Symbol-Independent Point Classification Model

Invalid Date

1. Introduction

Handwriting generation has been an area of active research for a long time, driven by applications in digital documents, personalized fonts, and assistive technologies. Both online (stroke-by-stroke) and offline (image-based) handwriting generation have been extensively studied [1]. They evolved toward more sophisticated architectures with a strong emphasis on disentangling style and content.

Offline methods using GANs, diffusion models, and visual transformers produce high-quality images but suffer from unrealistic spacing, background noise, inconsistent ink, and loss of subtle details. Online handwriting generation evolved from early RNN-based approaches [2] into more complex architectures [3-5]. A notable advancement was the development of models trained to disentangle style representations at both writer and character levels [6-7], but these solutions focused on Chinese writing, not addressing cursive scripts where ligatures—connecting elements between letters—are essential for Latin and Cyrillic scripts [8]. Meanwhile, existing ligature generation methods [9-10] rely on manually designed heuristics and geometric assumptions, requiring extensive tuning and failing to adapt to diverse writing styles.

As a result, current approaches cannot fully address the challenge of generating high-quality cursive handwriting that preserves individual writing styles and natural letter connections, while maintaining computational efficiency for real-time editing on mobile devices.

We propose a novel approach for generating cursive handwritten text as digital ink traces, complementing existing single-character generation methods [6-7]. Our method learns structural segmentation directly from data using a lightweight RNN classifier and applies trace transformation to seamlessly connect symbols while preserving handwriting style. This enables unified cross-language support and real-time mobile performance.

The evaluation results demonstrate the effectiveness of our approach in cursive handwriting text generation and replicating nuanced writing styles while enabling real-time responsiveness. Although tested on Latin-based languages, the method is adaptable to other scripts with connected writing.

2. Our approach

We propose a method for handwriting ligature synthesis in two steps: structural segmentation for each symbol and stroke transformation to generate ligatures. The approach pipeline is illustrated in Fig. 1.

Figure 1. The complete approach pipeline.

2.1 Head/Tail Detection

We generalize the structural segmentation step by learning it directly from data using a supervised model rather than relying on fixed rules or templates.

All characters undergo preprocessing step that includes: linear resampling (20 points), spatial normalization, and feature calculation (coordinates and a binary pen state flag). Notably, the model does not receive character labels (text codes) as input and operates solely on spatial coordinates, making it character-independent. A lightweight RNN-based pointwise classifier (about 12k parameters) with two stacked BiGRU layers predicts the class (head / body / tail / isolated) of each point based on position and context, enabling adaptation to varying handwriting styles beyond rule-based approaches.

The resulting structural labels then guide targeted stroke transformation for coherent handwriting modification.

2.2 Trace Deformation

We propose trace deformation algorithm via point-wise optimization to adjust flexible sections of handwritten characters for smooth ligature generation. The boundary points of the deformable segments remain unchanged, while the optimization balances three objectives:

Connection distance minimizes the gap between the endpoints of the transformed tail and head segments, ensuring seamless connections.

Local smoothness distance ensures smooth local transitions without unnatural bends by penalizing large deviations between consecutive points.

Displacement distance prevents excessive deviation from the original character strokes by penalizing the overall shift of each point from its original position.

The total objective is a weighted sum of these three components, with empirically selected weight coefficients to guarantee continuity, stability, and smoothness. After calculating the partial derivatives, the convex optimization problem reduces to a tridiagonal system of linear equations solvable in O(N) operations, enabling real-time performance on resource-limited devices.

3. Experiments and Results

In our study, the primary objective is to assess ligature synthesis with respect to text readability, visual appeal, and consistency with the user’s original handwriting. We evaluate our approach through quantitative analysis, qualitative assessment, and efficiency measurements on mobile devices. Experiments focused on end-to-end evaluation and are limited to English.

3.1 Quantitative analysis

We assess readability, visual appeal, and style consistency using standard metrics (Table 1). Lower Handwriting Distance (HWD) [3], Fréchet Inception Distance (FID) [4], and Kernel Inception Distance (KID) [5] values for connected text confirm improved similarity to user handwriting.

Table 1. Quantitative similarity metrics of generated handwriting with and without connections relative to user-written samples.

Readability was measured via character recognition rate (CRR) and word recognition rate (WRR) using multiple text recognition systems (Table 2). Generated text achieved higher recognition rates than originals, with only ~1% decrease after adding cursivity, demonstrating its limited impact on readability. Across all tools, readability differences between user-written and generated text remained under 5%, demonstrating style-consistent text generation.

Table 2. Readability comparison for user-written and generated text.

3.2 Qualitative evaluation

A user study with 18 participants evaluated 55 image pairs (original/generated) on readability and visual appeal (Fig. 2). For readability, 87% of generated images were non-inferior, with 36% rated better. For visual appeal, 71% of generated matched or exceeded originals. Remaining issues stemmed primarily from symbol synthesis artifacts rather than ligature generation.

Figure 2. Qualitative evaluation survey: choice distribution.

3.3 Efficiency

The performance evaluation was conducted on Samsung Galaxy S25 (CPU-only, single-thread). We assessed different stages of the handwriting generation process, beginning with symbol synthesis using two different methods, followed by head/tail detection and trace deformation steps (Table 3). Provided results for ligature generation by the proposed approach demonstrated exceptional on-device performance.

Table 3. Time used for different stages of a single text symbol generation.

4. Conclusions

The results confirm that our approach effectively generates natural cursive handwriting while preserving letter shapes. Generated text remains minimally detectable to both human reviewers and automated tools, with WRR improvements of up to 3.21%. Operating directly on raw points ensures computational efficiency suitable for real-time use on low-end mobile platforms.

The method produces generated text that can imperceptibly replace or extend the user’s handwriting, enabling applications such as on-the-fly error correction, auto-completion, and personalized content generation, with letter-level corrections possible without regenerating entire words.

Future work will focus on enhancing adaptability to diverse handwriting styles and extending support to other scripts, including right-to-left writing systems.

Link to the paper

References

1.Diaz, Moises, et al. "A survey of handwriting synthesis from 2019 to 2024: A comprehensive review." Pattern Recognition162 (2025): 111357.
2.Graves, Alex. "Generating sequences with recurrent neural networks." arXiv preprint arXiv:1308.0850 (2013).
3.Aksan, Emre, Fabrizio Pece, and Otmar Hilliges. "Deepwriting: Making digital ink editable via deep generative modeling." Proceedings of the 2018 CHI conference on human factors in computing systems. 2018.
4.Tang, Shusen, and Zhouhui Lian. "Write Like You: Synthesizing Your Cursive Online Chinese Handwriting via Metric‐based Meta Learning." Computer Graphics Forum. Vol. 40. No. 2. 2021.
5.Chang, Jen-Hao Rick, et al. "Style equalization: Unsupervised learning of controllable generative sequence models." International Conference on Machine Learning. PMLR, 2022.
6.Dai, Gang, et al. "Disentangling writer and character styles for handwriting generation." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023.
7.Liu, Yu, et al. "Elegantly written: Disentangling writer and character styles for enhancing online Chinese handwriting." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024.
8.Korovai, Karina, et al. "Handwriting enhancement: recognition-based and recognition-independent approaches for on-device online handwritten text alignment." IEEE Access 12 (2024): 99334-99348.
9.Wang, Jue, et al. "Combining shape and physical modelsfor online cursive handwriting synthesis." International Journal of Document Analysis and Recognition (IJDAR) 7.4 (2005): 219-227.
10.Lin, Zhouchen, and Liang Wan. "Style-preserving english handwriting synthesis." Pattern Recognition 40.7 (2007): 2097-2109.

BLOG | Samsung Research

PIX-TAB: Efficient PIXel-Precise TABle Structure Recognition Approach with Speculative Decoding and Region-Based Image Segmentation

Introduction

Overview

Position-Aware Pixel-Precise Tokens

Model Architecture

Speculative Decoding

Region-Based Image Segmentation

Experiments and Results

Metrics

Experimental Results

Conclusions

Link to the paper

References

Related Stories

LAFUFU: Latent Acoustic Features for Ultra-Fast Utterance Restoration

1. Introduction

2. Background

Problem formulation

Score-based generative models

Latent space diffusion

3. Methodology

Architecture

Experimental setup

4. Results

EARS-WHAM benchmark

EARS-Reverb benchmark

5. Discussion

Ablation study

Comparison with prior work

Limitations and future work

Link to the paper

References

6G ISAC: Expanding the Value of Mobile Networks Beyond Connectivity

Introduction: From Connectivity to Awareness

What ISAC Means for 6G Networks

From 5G ISAC to 6G ISAC: Why the Scope Needs to Expand

Key Use Cases for 6G ISAC

1. Environment and Background Sensing

2. Vehicle, RSU, and Fixed-UE Sensing

3. Indoor CPE-Based Sensing

4. Communication Assistance

What 6G ISAC Needs to Address

Start from Communication-Compatible Design

Clarify Sensing Modes and Node Assumptions

Build a Measurement and Reporting Framework

Evaluate Interference Realistically

Balance Sensing Gain and System Cost

Conclusion

Practical AI-Driven Traffic Classification for Next Gen Service-Aware RAN

1. Introduction

2. Technical Challenges

3. The Proposed Scheme

3.1 System Overview

3.2 Clustering-based Labeling

3.3 Traffic Classification

3.4 Adaptive RRC State Control

4. Implementation Results

4.1 Experimental Environment

4.2 Traffic Classification Performance

5. Conclusion

References

From Modules to Agents: An Automatic AI Inference Optimization Compiler for 5G RAN

1. Introduction

2. The Case for Automatic AI Inference Optimization Compiler

3. First Generation: A Module-Based Compiler

3.1 Model Parser

3.2 Optimizer

3.3 Code Generator

4. Achievements and Limitations of the Module-Based Compiler

5. Second Generation: The Multi-Agent AI Inference Optimization Compiler

5.1 A Feasibility Study: Can an LLM Write Production-Quality SIMD Kernels?

5.2 Architecture and Methodology of the Multi-Agent Compiler

5.3 Results and Discussion

6. Conclusion

References

AI in 6G network: Service and System Aspect

Review and Limitations of 5G Standardisation Related to AI

5G Core Network Automation

Other Aspects of AI in 5G Core Networks