# Hardware Acceleration for Knowledge **Graph Processing: Challenges & Recent Developments**

MACIEJ BESTA<sup>1</sup>, ROBERT GERSTENBERGER<sup>1</sup>, PATRICK IFF<sup>1</sup>, POURNIMA SONAWANE<sup>2</sup>, JUAN GÓMEZ LUNA<sup>1</sup>, RAGHAVENDRA KANAKAGIRI<sup>3</sup>, RUI MIN<sup>4</sup>, GRZEGORZ KWAŚNIEWSKI<sup>1</sup>, ONUR MUTLU<sup>1</sup>, TORSTEN HOEFLER<sup>1</sup>, RAJA APPUSWAMY<sup>5</sup>, AIDAN O MAHONY<sup>2</sup>

<sup>4</sup>HIRO-MicroDataCenters BV, Heerlen, Netherlands

Corresponding authors: maciej.besta@inf.ethz.ch, Aidan.Omahony@dell.com

**MAHONY2**<sup>1</sup>ETH Zurich, Zürich, Switzerland
<sup>2</sup>Dell Technologies, Ovens, Co. Cork, Ireland
<sup>3</sup>Indian Institute of Technology Trupati
<sup>4</sup>HIRO-MicroDataCenters BV, Heerlen, Nether
<sup>5</sup>Eurecom, Biot, France
Corresponding authors: maciej.besta@ir **ABSTRACT** Knowledge § in the area of the Semantic mining and search engines. different types of heterogen is to provide a systematic lit a classification of the primar for accelerating certain kno focusing on how KG relate identify various research ga value both for academics an **INDEX TERMS** Knowle Review, Graph Algorithms, **I. INTRODUCTION K.** NOWLEDGE graphs (1 tions of information th organize data in a way that derstandable. They are used including information retrieva and artificial intelligence (AI) In the current era of data-ce **ABSTRACT** Knowledge graphs (KGs) have achieved significant attention in recent years, particularly in the area of the Semantic Web as well as gaining popularity in other application domains such as data mining and search engines. Simultaneously, there has been enormous progress in the development of different types of heterogeneous hardware, impacting the way KGs are processed. The aim of this paper is to provide a systematic literature review of knowledge graph hardware acceleration. For this, we present a classification of the primary areas in knowledge graph technology that harnesses different hardware units for accelerating certain knowledge graph functionalities. We then extensively describe respective works, focusing on how KG related schemes harness modern hardware accelerators. Based on our review, we identify various research gaps and future exploratory directions that are anticipated to be of significant value both for academics and industry practitioners.

**INDEX TERMS** Knowledge Graphs, Semantic Web, Hardware Architectures, Systematic Literature Review, Graph Algorithms, Heterogeneous Hardware, FPGA, GPU, ASIC, CPU

**K** NOWLEDGE graphs (KG) are structured representa-tions of information that are used to represent and organize data in a way that is easily accessible and understandable. They are used in a variety of applications, including information retrieval, natural language processing, and artificial intelligence (AI) [1]–[6].

In the current era of data-centric ecosystems, it has become vitally important to organize and represent the enormous volume of knowledge appropriately. Recently, knowledge graphs have risen as a powerful tool for representing complex associations among entities and concepts across various domains, enhancing semantic search. As these knowledge graphs continue to grow in both scale and complexity, conventional computing methods encounter difficulties in effectively processing and analysing them in real-time, and addressing these challenges has prompted the investigation of hardware acceleration as a potential solution.

Hardware acceleration involves harnessing the capabilities of specialized hardware components designed to perform specific tasks more efficiently than what would be possible using a general-purpose Central Processing Unit (CPU), such as Graphics Processing Units (GPUs) or Field Programmable Gate Arrays (FPGAs). Both GPUs and FPGAs [7], [8] use strategies such as optimized memory use and low-precision arithmetic to accelerate computation, adding a boost to CPU server engines. While hardware acceleration has demonstrated remarkable success in various computational domains, its impact on knowledge graph processing remains relatively unexplored. Hardware acceleration has the potential to substantially enhance the performance of knowledge graph applications, enabling quicker and more precise data processing and analysis.

In this systematic literature review, we will examine the applications and consequences of hardware acceleration on knowledge graphs. We will review the existing literature on this topic and identify the main findings and trends in the use of hardware acceleration in knowledge graph applications. We will also consider the potential benefits and drawbacks of hardware acceleration, and we identify challenges and opportunities for future research and development.

# **II. OVERVIEW OF KNOWLEDGE GRAPHS**

Knowledge Graphs (KGs) accumulate and convey knowledge of the real world. They can effectively organize data to represent complex information, so that it can be efficiently and extensively explored in traditional and advanced applications, offering significant benefits for data exploitation in creating new knowledge. Knowledge graphs have emerged as an approach for the systematic representation of knowledge of real-world entities in a machine-readable format [9].

# A. REPRESENTATIONS

A knowledge graph is usually modeled using one of two data representations [10]: the Labeled Property Graph (LPG), also called property graph, and the Resource Description Framework (RDF). The **LPG model** categorizes vertices and edges with the help of labels and allows attributes for vertices and edges in the form of key-value pairs as properties. **RDF** represents knowledge graphs in the form of triples, where each triple consists of a subject, a predicate, and an object. Formally, edges are represented as triples (h, r, t), where h and t are the head and tail entities, and r is the relation between them.

Knowledge graphs, regardless of the used data model, capture the relationships between entities and are composed of entities (vertices, also referred to as nodes) and relationships (edges), forming a graph structure that allows complex interconnections and associations to be easily visualized and understood.

# **B. EMBEDDINGS**

The knowledge graph input is typically human-readable, however certain tasks benefit from a transformed (embedded) machine representation. **Knowledge Graph Embedding** (**KGE**) models provide a way to represent entities (vertices) and relations (edges) of a knowledge graph in vector spaces, referred to as embeddings. These models capture the semantics of the graph and are used in various downstream tasks like link prediction, classification, and recommendation.

The goal of training a knowledge graph embedding model is to learn embeddings for entities and relations such that the embeddings of the head and tail entities are close to each other in the embedding space when connected by a relation. This is achieved by optimizing a loss function that captures the likelihood of the observed triples in the graph. One common approach is to train on both positive and negative triples. A positive triple is an observed triple in the graph, and a negative triple is a corrupted version of a positive triple. The negative triples are sampled by replacing the head or tail entity of a positive triple with another entity in the graph.

# C. BENEFITS AND APPLICATIONS

The benefits of knowledge graphs are manifold. They provide a structured and semantically rich representation of knowledge, which can be leveraged for various applications. For instance, KGs can be used to improve search engine results by understanding the context and semantics behind a user's

.

2

query. They can also be used in recommendation systems to provide more personalized and context-aware recommendations [11].

In addition to these, knowledge graphs have been successfully applied in numerous other domains. For instance, in the pharmaceutical industry, KGs can be used to represent complex relationships between drugs, diseases, and patients, thereby aiding in drug discovery and personalized medicine [12], [13]. In the field of social sciences, knowledge graphs can be used to analyze social networks and understand the dynamics of social interactions [14]–[16].

In education, several knowledge graph-based applications focus on supporting remote teaching and learning. For example, considering the importance of course allocation tasks in universities, a knowledge graph-based approach was proposed to automate this task. One could construct a course knowledge graph in which the entities are courses, lecturers, course books, and authors in order to suggest relevant courses to students [17]–[20].

In healthcare, the growth of the medical sector has led to more options for treatments. To help with this, medical recommender systems, especially biomedical knowledge graphbased recommender systems (such as doctor and medicine recommender systems), have been developed. For instance, in recommending medications, one can construct a heterogeneous graph whose nodes are medicines, diseases, and patients to recommend accurate and safe medicine prescriptions for patients with complicated medical issues [21]–[29].

Various works also propose to enhance general generative models with knowledge graphs. The focus of these works is usually to use KGs in order to enhance the LLM answers, for example by grounding knowledge in general models to reduce effects such as hallucinations [30]-[34]. Example schemes include Knowledge Graph Prompting (KGP) [35], Graph Neural Prompting (GNP) [36], Think-on-Graph (ToG) [37], Knowledge Solver (KSL) [38], KnowledGPT [39], and others [40], [41]. Zhu et al. [42] discuss how LLMs can be used for enhancing KG construction and tasks. Wen et al. [43] present MindMap, a framework to perform reasoning on KG data. Pertinent triples from a KG are retrieved and the LLM is prompted to answer a question based on these triples and show the reasoning process by generating a "mind map" in the form of a textual reasoning tree.

Retrieval Augmented Generation (RAG) enhances the abilities of LLMs by enabling the retrieval of documents into the LLM context to provide more accurate and relevant responses. MRAG [44] focuses on the multi-aspect problems where as structure-enhanced RAG schemes employ different strategies for structuring text to improve retrieval quality. A common idea is to construct a knowledge graph from text, which enables retrieval amongst entities and relationships [45]–[49]. RAPTOR [50] generates multi-level summaries for clusters of related documents, building a tree of summaries with increasing levels of abstraction to better capture the meaning of the text. Graph RAG [51] creates a

| GPUs (§IV)                                                                                                                                                   | FPGAs (§V)                                                                                                                    | PIM (§VI)                                                                      |                                         | RDMA (§VII)                                                                                                                       |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------|-----------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------|
|                                                                                                                                                              |                                                                                                                               | PNM                                                                            | PUM                                     |                                                                                                                                   |
| Use-Cases<br>- Knowledge graph embeddings<br>(DGL-KE, GraphVite, Marius)<br>- Graph neural networks<br>(TC-GNN, DiPAD, TinyKG)<br>- Graph analytics & mining | Use-Cases<br>- Defect detection for software<br>- Updates in dynamic graphs<br>(GraSU)<br>- Web-query composition & execution | Use-Cases<br>- Matrix-matrix multiplication in<br>graph neural networks (GNNs) |                                         | Use-Cases<br>- Large-scale, distributed graph<br>databases<br>(CGE, A1, Wukong, RDMA_Mongo,<br>Nessie, HERD, HydraDB, InnerCache) |
|                                                                                                                                                              |                                                                                                                               | Advantages<br>- Minimized data movement                                        |                                         |                                                                                                                                   |
| (WikiSearch, SMORE, DSNAPSHOT)                                                                                                                               |                                                                                                                               |                                                                                |                                         |                                                                                                                                   |
| Advantages<br>- Massive parallelism<br>- Efficient memory management                                                                                         | Advantages<br>- High parallelism<br>- High configurability                                                                    |                                                                                | memory- and compute-bound               | Advantages<br>- Low-latency, high-bandwidth com-<br>munication by bypassing OS & CPU                                              |
| Limitations<br>- Expensive hardware<br>- Limited memory capacity<br>- Complex to program                                                                     | Limitations<br>- Configuring FPGAs requires<br>HDL expertise which is not<br>commonly available                               | Limitations<br>- Unclear per-<br>formance if me-<br>mory exceeded              | Limitations<br>- Immature<br>technology | Limitations<br>- Increased complexity of<br>programming and maintenance                                                           |
| Opportunities for Future Research (§VIII)                                                                                                                    |                                                                                                                               |                                                                                |                                         |                                                                                                                                   |
| - SmartNICs<br>- Tensor processing units (TPUs)<br>- AI accelerators                                                                                         | - Quantum computing<br>- Neuromorphic computing                                                                               | - Cryogenic computing<br>- Chiplet architectures                               |                                         | - On-chip interconnects<br>- ASICs                                                                                                |

# Hardware Acceleration for Knowledge Graph Processing

FIGURE 1: Overview of different hardware acceleration areas used for knowledge graph processing.

knowledge graph, and summarizes communities in the graph, which provide data at the different levels of abstraction.

#### D. CHALLENGES AND FUTURE DIRECTIONS

Despite their numerous benefits, knowledge graphs also pose several challenges. Their heterogeneity, as mentioned earlier, is one such challenge. It requires the development of sophisticated techniques for KG embedding that can effectively capture and preserve the diverse structures and semantics inherent in the knowledge graphs [52].

Another challenge lies in the dynamic nature of knowledge. As new information becomes available, KGs need to be updated to reflect this new knowledge. This requires efficient methods for knowledge graph updating and evolution [53].

Furthermore, the quality of the knowledge graph is heavily dependent on the quality of the input data. Hence, ensuring the accuracy and reliability of the data used to construct the KG is another significant challenge [54], [55].

The existing methods for generating knowledge graph embeddings still suffer several severe limitations. Many established methods only consider surface facts (triplets) of knowledge graphs. However, additional information, such as entity types and relation paths, are ignored, which can further improve the embedding accuracy. The performance of most traditional methods that do not consider the additional information is unsatisfactory. Recently, some researchers have started to combine additional information with a knowledge graph to improve the efficiency of embedding models [56].

Finally, more efficient processing of KGs is of great relevance, in the face of the ongoing growth of the dataset sizes. One strategy for achieving more performance is incorporating hardware acceleration techniques.

# **III. OVERVIEW OF HARDWARE ACCELERATION**

Hardware acceleration involves offloading specific computational tasks from the CPU to specialized hardware components within a system, leading to more efficient task processing. There are several kinds of hardware acceleration, including GPU (Graphics Processing Unit) for graphics and parallel tasks, DSP (Digital Signal Processor) for handling signals like audio, FPGA (Field-Programmable Gate Array) which can be customized for different uses after its production, ASIC (Application-Specific Integrated Circuit) designed for specific tasks, and NPU or AI Accelerators aimed at speeding up machine learning tasks. Figure 1 presents an overview of the covered hardware acceleration areas for knowledge graph processing.

To facilitate hardware acceleration, various technologies like CUDA by NVIDIA for general-purpose computing on GPUs, OpenCL for programming diverse systems, DXVA for hardware-accelerated video decoding, and WebGL for web-based graphics are used. These tools enable improved performance and energy efficiency, freeing up CPU resources for other tasks and enhancing the user experience. However, challenges such as compatibility issues, increased software complexity, higher initial costs, and the risk of hardware failure also arise. Despite these challenges, the benefits of hardware acceleration, including better performance and efficiency, make it a key element in advancing technology, especially as we move towards more specialized processing tasks.

#### **IV. GPUS & KNOWLEDGE GRAPHS**

Graphics Processing Units (GPUs) have emerged as powerful units for a multitude of computationally intensive tasks. Initially made for gaming graphics, GPUs are now crucial in many devices including smartphones, computers, and gaming consoles. Moreover, GPUs support massive parallelism, making them ideal for tasks that require heavy computation, such as machine learning, scientific computing, and cryptocurrency mining. They are now a key component in supercomputers and data centers. Nine out of the ten top supercomputers in the TOP500 list are powered by GPUs [57].

#### A. FUNDAMENTAL GPU CONCEPTS

GPUs have several some key benefits that make them suitable for accelerating knowledge graph applications. GPUs support massive parallelism and excel at performing many operations simultaneously with thousands of cores, making them ideal for processing large-scale knowledge graphs, executing graph algorithms, and training machine learning models on knowledge graphs. Efficient memory management of GPUs can greatly speed up the processing of knowledge graphs. GPUs can use thread-level parallelism and employ warp scheduling. Specifically, GPU threads are grouped into warps, which are scheduled for execution together. By carefully organizing threads that access adjacent graph nodes or edges into the same warp (a technique known as "coalescing"), one can maximize memory access efficiency and minimize warp divergence, leading to significant performance improvements. GPU frameworks often allow the asynchronous execution of different operations, enabling the overlap of computation and memory transfer to hide latencies and improve throughput. Additionally, stream prioritization of operations can ensure responsive interactive querying in knowledge graph applications. Hardware-accelerated libraries such as cuGraph [58] (from NVIDIA's RAPIDS suite) provide GPU-accelerated graph analytics algorithms, which can be used in knowledge graph applications. For deep learning on knowledge graphs, libraries such as PyTorch Geometric offer GPU-accelerated graph neural network layers.

# B. KNOWLEDGE GRAPHS WITH GPUS

GPUs have been increasingly leveraged in knowledge graph applications due to their capability for parallel processing in two main areas. First, knowledge graphs are typically stored and indexed using high-performance graph database engines. Data analytics and machine learning techniques are applied to data stored in graph databases by extracting relevant data from the knowledge graphs using query languages (SPARQL for RDF graphs and Cypher, Gremlin as well as others for property graphs). A large body of work [59], [60] has focused on accelerating graph databases and individual knowledge graph queries using GPUs. Second, knowledge graphs are employed to learn inductive information using either supervised or unsupervised machine learning approaches. Researchers have focused on the use of GPUs for accelerating this learning process which we outline below.

# 1) Knowledge Graph Embeddings

Embedding matrices are large and they typically do not fit into the limited GPU memory. A common approach to address this challenge is to keep the embeddings in the main memory and transfer them to the GPU memory as needed. However, this results in severe latency penalties if GPUs exchange data with the main memory frequently.

In a parallel setting where multiple workers together train a model, the graph and the embeddings need to be partitioned across workers. Depending on the partitioning strategy, workers might need access to embeddings of entities and relations that are not local to them. Workers also need to synchronize their updates to the embedding matrices. Depending on the sampling strategy employed, workers might require further remote access for entities of negative triples. This results in a high degree of communication between workers, which can be a bottleneck in the training process.

Several works have explored parallel training of knowledge graph embedding models on GPUs, tackling the above challenges. DGL-KE [12] is a distributed training framework for KGE models that uses a hybrid CPU-GPU system. It employs a distributed key-value store for both the knowledge graph structure and the embeddings, using shared CPU memory. A GPU worker unit retrieves embeddings from CPU memory, updates them, and then writes the embeddings back to the CPU memory. In DGL-KE, the knowledge graph is partitioned via METIS [61] such that most of the entity and relation embeddings are local, in order to minimize communication between the compute units. GPUs are not efficient in handling random memory access, and hence DGL-KE samples the negative triples in the CPU, and then transfers them to the GPU for training. This sampling is done from the local METIS partition to ensure that there is no increase in remote accesses for negative samples. Other optimizations in DGL-KE include relation partitioning and overlap of gradient updates for relation as well as entity embeddings.

**GraphVite** [62] focuses on multi-GPU training of KGE models. In line with past works, it tackles the challenges of limited GPU memory and bus bandwidth as well as synchronization overhead. GraphVite partitions the embeddings in a way that avoids CPU-GPU or inter-GPU communication during training. Positive triples are partitioned so, that they access pairwise disjoint embeddings, and negative triples are sampled from the local partition. Workers perform minibatch updates on the local embeddings and synchronize only at the end of an epoch.

**Marius** [63] is a framework designed for efficient computation of graph embeddings on a single machine by leveraging partition caching and buffer-aware data orderings to minimize disk access and interleave data movement with computation. The pipeline updates node embedding parameters in CPU memory asychronously, allowing for staleness, while the relation embeddings are updated in GPU memory synchronously. This design choice is based on the observation that updates to node embedding vectors are sparse, whereas updates to relation embedding parameters are dense due to the smaller number of edge-types in real-world graphs.

Another way to use GPUs for KGEs is to transform the knowledge graph completion problem into a similarity join problem, which can be efficiently processed by GPUs. This method can leverage the metric properties of some KGE models, such as TransE [64] and RotatE [65], to reduce the number of vector operations and filter out irrelevant candidates. By using GPUs, this method can achieve fast and accurate knowledge graph completion on large-scale datasets [66].

#### 2) Graph Neural Networks

A graph neural network (GNN) [67] is a neural network in which input samples are graphs. GNNs, as opposed to embeddings, support end-to-end learning. Thus, GNNs can be used to solve various KG-related tasks like link prediction, knowledge graph alignment, and node classification [68]. GPUs can accelerate the computation of GNNs by exploiting the parallelism and locality of graph operations and utilizing the specialized hardware features of GPUs. For example, **TC-GNN** [69] proposes the use of tensor cores to accelerate sparse matrix multiplication in GNNs by transforming the sparse graph data into dense tensors. **PiPAD** [70] proposes to use pipelining and parallelism techniques to improve the efficiency and scalability of dynamic GNN training on GPUs.

Traditional KGE methods mainly focus on predicting the legitimacy between two entities and a particular relation type. GNNs have been shown to be effective in capturing the topological features of entities such as shapes of neighborhood sub-graphs which are overlooked by the traditional KGE methods [71]. However, their model complexity is higher in terms of the number of trainable parameters.

**Sheikh et al.** [72] propose three key strategies to scale GNNs to large knowledge graphs. Their system leverages vertex-cut partitioning to create self-sufficient graph sections and employs local negative sampling within partitions, significantly reducing communication overhead. It also utilizes edge mini-batch training, allowing efficient handling of large graph sections on GPUs.

**TinyKG** [73] is a a memory-efficient framework for training Knowledge Graph Neural Networks. Traditional training of these networks is memory-intensive due to the need to store all intermediate activation maps for gradient computation, making deployment challenging in memory-constrained environments. TinyKG addresses this by using exact activations during the forward pass and storing a quantized version in the GPU buffers. During the backward pass, these quantized activations are dequantized for gradient computation. TinyKG employs a simple quantization algorithm to compress activations, reducing the training memory footprint with minimal accuracy loss.

#### 3) Symbolic Learning and Rule Mining

Machine learning techniques that learn numerical models are hard to interpret and quantify. Instead, symbolic learning can be used to learn hypotheses in a logical (symbolic) language that "explains" sets of positive and negative edges. Such hypotheses are interpretable and quantifiable (e.g., "all airports are domestic or international"), partially addressing the out-of-vocabulary issue. Symbolic learning techniques such as rule and pattern mining are used to discover interesting patterns over knowledge graph data. GPUs can exploit features of rule mining algorithms for more performance, for example frequent itemset generation, candidate pruning, support counting, and confidence evaluation. There exists a large body of work that focuses on accelerating rule mining specifically, and frequent itemset mining more generally, using GPUs. Most of them focus on accelerating the underlying algorithm like Apriori [74] that is widely used for rule mining. We refer the reader to a recent survey [75] of several GPU accelerated frequent itemset mining solutions proposed in literature for further reference.

#### 4) Graph Analytics and Mining

The application of analytical methods to large-scale graphs is known as graph analytics. Such algorithms frequently examine the graph topology, or how nodes and groups of nodes are related. GPUs can accelerate a wide range of graph algorithms such as breadth-first search (BFS), single-source shortest path (SSSP), and community detection [60].

**WikiSearch** [76] is an efficient parallel keyword search engine designed for large-scale knowledge graphs, with a focus on the Wikidata Knowledge Base, though it is applicable to other knowledge graphs as well. To exploit parallelism, a novel approach for keyword search based on the central graph method is proposed. Unlike traditional methods that approximate the group Steiner tree problem, this approach can naturally operate in parallel and returns compact, informationrich answer graphs. It is optimized for both multi-core CPU and GPU architectures. WikiSearch also introduces a novel pruning strategy based on keyword co-occurrence to refine search results further.

Scalable Multi-hOp REasoning (SMORE) [77] is a framework for both single-hop knowledge graph completion and multi-hop reasoning on large knowledge graphs, that involves predicting answers to queries that span multiple relations or hops in the graph, which requires capturing complex dependencies and performing logical operations over entities and relations. The computational complexity for such tasks increases significantly with the number of hops, leading to higher memory requirements and processing times. For a massive knowledge graph containing hundreds of millions of entities, it is not feasible to materialize training instances, and training data needs to be efficiently sampled on the fly with high throughput to ensure efficient utilization of computation resources. SMORE addresses this with a novel bidirectional rejection sampling approach for efficient online training data generation and an asynchronous system design that overlaps data sampling, embedding computation, and CPU-GPU communication. Also, graph partitioning is not feasible for multihop reasoning, as it requires traversing multiple relations in the graph, which will often span across multiple partitions. SMORE is designed to operate in a shared memory environment, bypassing the limitations of graph partitioning in multi-hop reasoning, and demonstrates near-linear speed-up with the number of GPUs used for training.

In the field of biomedicine, the connections among various biomedical entities, including drugs, diseases, symptoms, proteins, and genes play a crucial role in understanding the underlying mechanisms of diseases and drugs. Biomedical knowledge graphs play an important role in representing these connections and are used in various applications such as drug discovery and repurposing. Distributed Accelerated Semiring All-Pairs Shortest Path (DSNAPSHOT) [78] is a scalable knowledge graph analytics system that can perform all-pairs shortest path (APSP) computation on large biomedical knowledge graphs. It exploits the relation between the semiring GEMM [79] and the APSP computation, and implements a GPU-optimized distributed semiring GEMM kernel, the key operation in the Floyd-Warshall algorithm for APSP computation. Further, DSNAPSHOT proposes optimizations for both inter-node and intra-node communication, and achieves 90% parallel efficiency on the Summit supercomputer.

# 5) Graph Visualization

Due to high dimensionality, heterogeneity, and sparsity of data, displaying knowledge graphs might be difficult. By offering parallel computing capability, high memory bandwidth, and specialized hardware characteristics for graphics tasks, GPUs can make it possible for large-scale graph data to be rendered and processed more quickly, thereby speeding up knowledge graph visualization. One such example is **KG4Vis** [80], a knowledge graph-based approach for visualization recommendation. It uses a TransE-based embedding technique to learn the embeddings of both entities and relations of the knowledge graph from existing dataset-visualization pairs. Such embeddings intrinsically model the desirable visualization rules and can be accelerated by GPUs.

# C. CHALLENGES & LIMITATIONS

As knowledge graphs continue to grow in size and complexity, GPUs will likely play an increasingly important role in managing and extracting value from these datasets. However, GPUs also have several disadvantages that might make them unsuitable for some knowledge graph applications.

The cost of high-performance GPUs can be steep, which can be a barrier to their use, especially for small organizations or individual developers. GPUs have their own onboard memory, which is typically much less than the main memory available to a CPU. While this memory is typically faster, the limited memory capacity can be a challenge when working with large datasets that do not fit into the GPU's memory. GPUs are more complex to program than CPUs. Writing code that effectively leverages the parallel processing capabilities of a GPU can require a different approach than what many developers are accustomed to [81]. GPUs often use more power and generate more heat than CPUs, which can lead to additional hardware requirements regarding power supply and cooling in a computer system. Not all tasks can be effectively parallelized and see benefits from a GPU [82]. Tasks with heavy data dependencies or those that are inherently sequential may not see a performance improvement on a GPU, and might even be slower than on a CPU. Such **limited tasks** may benefit from applying a hybrid (CPU+GPU) processing strategy [83].

Hence, it is worth noting that not all knowledge graph tasks can benefit from GPU acceleration. Certain operations, such as graph updates or graph schema modifications, may not be well-suited for GPU parallelism. The effectiveness of GPU acceleration will ultimately depend on specific graph algorithms, data sizes, and hardware configurations. Thus, concrete benchmarking and microarchitectural analysis of various knowledge graph-related tasks is required to understand the degree to which each task can benefit from GPU acceleration.

# V. FPGAS & KNOWLEDGE GRAPHS

Field-Programmable Gate Arrays (FPGAs) have emerged as integral components in contemporary digital electronics, facilitating the development of custom digital circuits with a degree of versatility unmatched by other devices. Characterized by arrays of programmable logic blocks and configurable interconnects, FPGAs offer a distinctive combination of adaptability and performance.

# A. FUNDAMENTAL FPGA CONCEPTS

The cornerstone of any FPGA, logic elements and logic blocks comprise arrays of both combinational and sequential circuit elements. Programmable in nature, they can be tailored to execute a myriad of logical functions, laying the groundwork for the vast functionalities FPGAs are known for. The pathways of configurable interconnects serve a pivotal role in an FPGA's architecture, facilitating signal routing across the device. Their adaptability ensures seamless communication between discrete segments of a given design, optimizing the device's functionality. A quintessential aspect of an FPGA's reprogrammability is its configuration memory. This component retains the user-defined design logic, effectively determining the FPGA's operational behavior. Acting as the interface between the FPGA and its external environment, I/O blocks are instrumental in the device's ability to both send and receive signals, thereby ensuring effective communication with other devices or components. In light of the stringent timing constraints often associated with FPGA applications, effective clock distribution and management are paramount. Mastery over clock management is crucial for the successful deployment of FPGA-based designs. Analogous to programming languages in the realm of software development, Hardware Description Language (HDL) such as VHDL and Verilog as well as, more recently, other abstractions such as High-Level Synthesis (HLS) [84] are employed to define and describe circuit behavior within an FPGA.

# B. KNOWLEDGE GRAPHS WITH FPGAS

FPGAs have been increasingly utilized for knowledge graph processing due to their high parallelism and configurabil-

ity, which can significantly enhance the efficiency of graph computations. FPGAs can be used to accelerate various graph processing tasks, including defect detection in software code [85], high-throughput updates on dynamic graphs [86], efficient traversal of edge-labeled directed graphs [87] and automated composition and execution of Semantic Web queries [88]. FPGAs have also been used for implementing various graph algorithms, such as BFS or PageRank [89]–[91].

Different techniques have been employed to optimize the use of FPGAs for knowledge graph processing. For instance, **GraSU** [86], an FPGA library designed for the Xilinx Alveo<sup>TM</sup> U250 accelerator card, exploits the spatial similarity of graph updates to improve overall efficiency. GraSU outperformed two state-of-the-art CPU-based dynamic graph systems significantly in terms of update throughput. Another technique involves the use of a pipeline approach that combines parallel BFS and nondeterministic finite automaton for efficient graph traversal [87]. Additionally, the use of partial runtime-reconfiguration enables transparent query evaluation on an FPGA [88].

Several specific schemes and mechanisms have been developed to optimize the use of FPGAs for knowledge graph processing. For instance, a work-stealing-based scheduler, **HWS** [92], has been designed to optimize workload balance on heterogeneous CPU-FPGA systems. Another example is the implementation of a stochastic matrix function estimator on FPGAs to boost the performance and energy efficiency of subgraph centrality computations [93], [94]. Furthermore, an accelerator for quantized Graph Convolutional Networks (GCNs) with edge-level parallelism has been developed, using low-precision integer arithmetic during inference [95], which demonstrated significant speedups and energy savings compared to other models.

# C. ADVANTAGES & DISADVANTAGES

FPGAs offer several advantages for semantic knowledge graph processing. They provide high parallelism and configurability, which can significantly enhance the efficiency of graph computations [85]. FPGAs can also provide significant speedups and energy savings compared to other models [95]. Despite the advantages offered by FPGAs and their rapid growth, the use of FPGA technology is restricted to a narrow segment of hardware programmers due to their code written differently using a hardware description language to design the FPGA configuration. The challenge with the HDL approach is that configuring an FPGA requires both coding skills and a detailed knowledge of the underlying hardware, and the required expertise is not widely available. More recent abstractions such as HLS [84], [96]-[100] attempt to alleviate these issues. Additionally, while FPGAs can provide significant performance improvements for certain tasks, they may not benefit all queries [88].

# VI. PROCESSING-IN-MEMORY & KNOWLEDGE GRAPHS

Processing-In-Memory (PIM) is a promising way to alleviate the data movement bottleneck [101], [102], i.e., the waste of execution cycles and energy due to moving data between memory/storage and compute units, in current processorcentric computing systems (e.g., CPU, GPU).

# A. FUNDAMENTAL PIM CONCEPTS

There are two main PIM trends. The first one is called **Processing-Near-Memory (PNM)** and consists of placing compute logic near the memory arrays (e.g., DRAM subarrays, banks, ranks) [103]–[109]. **Processing-Using-Memory (PUM)** is the other one, which leverages the analog operational properties of memory components (e.g., cells, sense amplifiers) to perform computation [110]–[114]. PIM represents a successful research trend in recent years, and several commercial PIM systems and prototypes [115]–[125] have been presented.

# B. KNOWLEDGE GRAPHS WITH PIM

Graph Neural Networks use deep learning to process graph data, including knowledge graphs [126]. GNNs can solve different knowledge graph tasks, such as link prediction, knowledge graph alignment and reasoning, and node classification [68]. There are several classes of GNNs: GCNs [127], attentional GNNs [128] and message-passing GNNs [129]. GCNs are composed of several GCN layers, each computing two steps: the aggregation of vertex features (a reduce operation), and a combination of features (an update operation with typically fully-connected layers). After the loss computation, the backward pass is composed of feature/weight gradients computation (update), and feature gradients aggregation (reduce). While update operations (e.g., matrix multiplication) are compute-bound, reduce operations (e.g. gather-reduce-scatter) are very memory-bound. As such, reduce operations are good candidates for PIM-based acceleration. Several recent works [130]-[134] propose PIM acceleration for GCNs. Some of these works deploy PUM techniques such as ReRAM-based crossbars [131], [133]. Other works use PNM techniques with processing units in DDR DIMMs [130], [134] and in HBM stacks [132].

PUM approaches rely on crossbar arrays, which help minimizing data movement in reduce operations, and computing matrix multiplication efficiently. **ReFlip** [133] proposes a unified crossbar-based PUM architecture that supports both compute-bound and memory-bound kernels. With software/hardware co-optimizations, ReFlip maps both types of kernels efficiently onto the massive parallelism of its ReRAM-based crossbar arrays. **COIN** [131] targets the huge communication overheads of GCNs. For example, processing the Nell [135] knowledge graph causes 2.7TB of data moving between nodes of a baseline crossbar-based PUM architecture. COIN proposes an optimized on-chip interconnection network for efficient communication between compute elements and between the crossbars inside each compute element. The network design is applicable to different crossbars such as ReRAM- and SRAM-based, but COIN prefers ReRAM, which is significantly more energy efficient.

PNM approaches combine heterogeneous computing units that are specialized for different steps. GNNear [134] integrates an ASIC with matrix multiply and vector processing units and PNM-enabled DIMMs. Update operations are computed on the ASIC, while execution units in the buffer chip of the DIMMs compute reduce operations. Huang et al. [130] tackle the large memory footprint and data movement needs of GNNs with memory pooling. The authors propose a customized memory fabric interface for low-latency and highthroughput communication across PNM units in memory extension cards. The PNM units contain a RISC-V core, a matrix multiply unit, and a vector processing unit. SGCN [132] exploits the sparse nature of intermediate GCN features to reduce the memory footprint (via compression) and optimize communication. SGCN places aggregation units (with SIMD MACs) and combination units (with a systolic array for matrix multiplication) near HBM memory.

# C. ADVANTAGES AND DISADVANTAGES

PUM approaches for GNNs offer the advantage that the same crossbar-based PUM unit can accelerate both memory-bound and compute-bound kernels. Their main disadvantage is that they are based on memory technologies that are not yet mature (e.g., limited endurance and high area of ADCs in ReRAM and other non-volatile memories).

PNM approaches tailor their execution units to the specific needs of each step, which represents an advantage of their approach. However, they have yet to show how their performance would scale for knowledge graphs exceeding their memory capacity.

While the aforementioned works show great promise for GCN acceleration, their evaluations are all based on simulation. We hope to see soon efficient implementations of GNNs on existing real-world PIM architectures [115]–[125] and future ones.

#### VII. CLUSTER-LEVEL RDMA & KNOWLEDGE GRAPHS

Remote Direct Memory Access (RDMA) is a mechanism for achieving high performance and scalability in both the supercomputing as well as the cloud data center landscapes [136]–[151]. RDMA has grown popular as RDMAenabled network interface cards have become widely used, and is commonly supported in modern interconnects [152], [153]. Overall, RDMA has many use-cases, particularly in distributed environment. Examples include speeding up data replication [154]–[158], transactions [10], [159]–[162], index queries [163], file systems [164], general queries [165]– [167], or analytical workloads [10], [126], [168]–[176].

#### A. FUNDAMENTAL RDMA CONCEPTS

In general, the advantages of RDMA stem from the fact that communication bypasses the OS and the CPU, reducing or eliminating overheads such as interrupts. While one can harness RDMA in different ways, highest performance is usually achieved with fully-offloaded one-sided communication. In this approach, processes communicate by directly accessing dedicated portions of other processes' memory. In the established one-sided communication specification included in the Message-Passing Interface [177], this portion is called a window.

One-sided accesses are done with communication operations referred to as puts and gets. They - respectively write to and read from windows, offering very low latencies and most often outperform other communication paradigms such as message passing [137]. Other useful RDMA operations include remote atomics such as Compare-and-Swap or Fetch-and-Add [176]-[179] that are often accelerated by the interconnect hardware. They enable very fast fine-grained synchronization. To enforce data consistency between windows, operations called flushes are employed to explicitly synchronize memories. The communication operations come in two variants, blocking (operation execution blocks till completion) and non-blocking (operation execution returns immediately upon initiating communication). The latter can additionally increase performance by overlapping communication and computation [137], with the user taking responsibility to synchronize memories at some point after starting the call. All of these routines are supported by most RDMA architectures.

# B. KNOWLEDGE GRAPHS WITH RDMA

Cray Graph Engine (CGE) [180], [181] is a system developed by Cray to support executing very large-scale RDF triple stores on top of Cray high-performance computing systems. CGE's design is based on the Partitioned Global Address Space (PGAS) abstraction, in which one creates a single logical memory pool encompassing all the physical distributed memories over all compute nodes. Thus, CGE effectively implements the Single Program Multiple Data (SPMD) model and uses a purely one-sided RMA programming model, where memory access is treated as effectively uniform (i.e., at the programming level, one does not distinguish between local or remote accesses). Hence, one does not need to consider problems such as efficient graph partitioning. Simultaneously, to achieve high performance, CGE heavily relies on different hardware features offered by the targeted systems. Such features include latency hiding, passive parallelism, and high network throughput for small remote requests that occur commonly in the targeted graph workloads. For example, one of the architectures used in the evaluation, the XMT2 system, used the Threadstorm processors with 128 hardware streams per node. For the whole system of a typical size (64-512 nodes), this amounts to a total of thousands to tens of thousands of software threads that can be used in a given single graph query. Finally, the CGE implementation relies on a low-level high-performance networking communication library [182] by Cray.

Improvements for graph analytics queries [183] with CGE were made by the use of non-blocking communication for large exchanges and by processing intermediate solutions in stages to exploit locality. Additional recent improvements [184] make CGE more portable by replacing low-level DMAPP operations with one-sided MPI routines, as well as simplifying the software stack. Several optimizations are employed to enable container performance matching that of native execution. Message-aggregation within each compute node is employed for join/scan/merge operations to reduce the number of messages. That number is further reduced by improving the communication patterns to enable storing of messages for the same compute node consecutively. Distinct flushes for put and get operations improve the overlap of computation and communication.

A1 [185] is a distributed in-memory graph database developed by Microsoft. It adds graph abstraction and query engine layers on top of an improved FaRM key-value store [160]. FaRM already comes with transaction support and uses one-sided read and write operations as well as RPCs to implement its functionality. A1 further improves FaRMs transactions by employing an optimistic multi-version concurrency control scheme by introducting a global clock and timestamps. RDMA is implemented with the RoCEv2 protocol [186] and DCQCN is used for congestion control. RDMA-based unreliable datagrams are used for clock synchronization and leases. A1 is latency-optimized by employing data structures, which reduce the number of read operations, co-locating data likely to be requested at the same time like nodes and their edges as well as RPC aggregation to reduce the number of messages. A1 uses a semi-structured data model based on Bond [187] with strictly-typed edges and weakly-typed nodes.

Wukong [188] is a research-oriented distributed inmemory RDF triple store. Its storage layer is implemented using a simplified version of a RDMA-friendly distributed hashtable based on DrTM-KV [159]. Wukong duplicates edges during graph partitioning to store self-contained subgraphs on each compute node to preserve locality. It supports indices based on type and predicate. These indices are treated as a special kind of nodes and are usually replicated. Strings are stored separately and mapped to unique IDs to reduce network bandwidth. Wukong supports concurrent query execution with full history pruning, data (in-place) and/or execution (fork/join) migration as well as task stealing for load balancing and to reduce query latency. Originally providing limited update support, Wukong spawned several improved implementations. Wukong+S [189] adds support for stream queries as well as incremental key-value updates. Wukong+G [190] uses GPUs to further improve query throughput by using the GPU memory as cache as well as the massive compute power of GPUs for triple parsing. Adaptive query scheduling was further proposed [191] for Wukong+G to combine the processing of multiple queries similar to the fusion of kernels.

**RDMA\_Mongo** [192], a document-oriented NoSQL database, uses RDMA writes to replace part of its TCP/IP-communication layer. **Nessie** [193], a key-value store, uses cuckoo hashing with RDMA for its key-value operations.

Nessie decouples index and data storage to improve locality. HERD [140], another key-value store, uses one-sided RDMA writes and two-sided RDMA send/receives to complete each of its operations with a single network round trip. The key-value store HydraDB [139] uses RDMA to accelerate its read operations and key caches with timestamps to reduce network pressure for highly skewed workloads. Similarly, InnerCache [194] uses one-sided RDMA to accelerate reads from the key-value store acting as an application cache and two-sided semantics for writing data. RDMA-based Memcached has been used for the integration of the Hadoop storage layer (HDFS) with the underlying high performance parallel filesystem Lustre to improve the I/O performance of big data analytics [195]. Additionally a non-blocking API extension for Memcached to improve communication and computation overlap as well as an enhanced runtime design for hybrid use with SSDs was proposed [196].

Finally, the **Graph Database Interface (GDI)** [170] has recently been proposed to deliver a toolbox for designing a scalable and high-performance data access and transaction layer for general graph databases that can also be used to maintain knowledge graphs. Its RDMA-based implementation, **GDI-RMA** [171], has been shown to scale to more than a hundred thousand compute cores and to label- and property-rich graphs with more than 500 billion edges. The key mechanisms used for high performance are one-sided non-blocking RDMA communication, hardware-accelerated network atomic operations, and collective communication, a form of group communication that has been tuned over decades by the MPI community [177].

# C. ADVANTAGES AND DISADVANTAGES

RDMA usually enables significant performance advantages in terms of both latency and bandwidth. The former is enabled by eliminating expensive parts of the communication pipeline (such as interrupts) and by supporting features such as network-accelerated atomic operations. The latter is facilitated by features such as the overlap of computation and communication. On the other hand, RDMA is usually more complex to program and maintain. This, however, has been alleviated with efforts such as the one-sided communication within MPI [137], [177] or by GDI [170], [171].

# **VIII. FUTURE RESEARCH OPPORTUNITIES**

We identified four research gaps in current solutions, that provide opportunities for future studies: scalability, energy efficiency, real-time processing and the integration with other technologies.

While current hardware solutions have shown promise in handling large-scale knowledge graphs, there is a pressing need to address the **scalability** challenges posed by the ever-growing size and complexity of semantic data. The energy consumption of hardware accelerators, especially when processing extensive knowledge graphs, remains a concern. Research into more **energy-efficient** hardware designs is crucial. The ability to process and update knowledge graphs in **real-time**, especially in dynamic environments, is still an area with limited research.

The synergy between different forms of hardware acceleration and **other technologies** is not fully explored. Existing works, such as DaCe [197], [198], focus on effective and efficient execution of different workloads on the underlying diverse hardware, targeting – among others – machine learning [199], linear algebra kernels [200], and – more recently – GNNs [201]. Extending this line of works towards knowledge graphs specifically is an interesting research opportunity. Another, related, opportunity is to combine this approach with other emerging technologies, such as quantum computing or neuromorphic computing.

In conclusion, while significant strides have been made in the domain of hardware-accelerated semantic knowledge graph processing, there remains a vast landscape of uncharted territory. By addressing the identified research gaps and capitalizing on the highlighted opportunities, the scientific community can pave the way for more efficient, scalable, and innovative solutions in the future.

#### A. NOVEL HARDWARE ACCELERATION SCHEMES

Beyond the hardware acceleration techniques previously mentioned, there are several innovative hardware solutions that can enhance knowledge graph processing.

**SmartNICs** are advanced network interface cards (NICs) that allow the processing of tasks on the NIC [202] instead of the CPU. For load balancing in distributed settings, Smart-NICs can distribute incoming queries to different servers to ensure efficient utilization of resources. They can also be used in the context of data preprocessing, where SmartNICs preprocess and filter irrelevant data, therefore speeding up the data ingestion process of the knowledge graph.

Originally designed for machine learning, **Tensor Processing Units (TPUs)** can also be harnessed for knowledge graph tasks that incorporate machine learning. TPUs can accelerate the training and inference of GNNs to be used for node and graph classification as well as link prediction in knowledge graphs. TPUs can also compute efficiently node and edge embeddings for similarity searches and clustering in knowledge graphs.

Additionally various **AI accelerators**, such as Google's Edge TPU and Intel's Nervana Neural Network Processor, can process knowledge graphs in order to detect anomalies or inconsistencies to guarantee data integrity. Google's Edge TPU can be used on edge devices to locally update a knowledge graph as new data streams in to ensure that graph processing remains current.

**Quantum Computing** is an emerging field with the potential to redefine computing. For knowledge graph processing it holds the promise of more efficient computation for various algorithms such as graph isomorphism, i.e. determining if two graphs are structurally identical, or pathfinding (finding the shortest path or optimal connections between nodes).

Inspired by the human brain, **Neuromorphic Comput**ing can be advantageous for traditionally challenging tasks. Neuromorphic chips can help to identify patterns or trends in knowledge graphs and can aide in tasks like recommendation systems or predictive analytics. Similar as a human brain learns from experience, neuromorphic computing can adaptively learn from data in the knowledge graphs and refine queries and results over time.

Utilizing superconducting circuits that function at ultralow temperatures, **Cryogenic Computing**, though in its infancy, has potential for large-scale knowledge graph processing tasks. At extremely low temperatures, superconducting circuits can process vast amounts of data simultaneously, which allows for massive parallel processing of large-scale knowledge graph analytics, where multiple queries and computations are performed concurrently. Cryogenic computing can also offer significant energy savings, making the processing more sustainable and cost-effective.

Chiplet architectures [203], where multiple silicon dies are integrated into a single package, have gained significant traction in the chip design industry [204]. Chiplets come with a wide range of benefits, including modularity, reusability, flexibility, specialization, cost-efficiency and reduced time-to-market. Even though we are not aware of any chiplet-based accelerators specifically designed to process knowledge graph, chiplets have played an important role in knowledge graph processing. A team of researchers was elected as Gordon Bell Prize [205] finalists for running their COAST (communication-optimized all-pairs shortest path) [206] algorithm for knowledge graphs on Frontier, the world's first exascale supercomputer, which uses chipletbased AMD EPYC CPUs [207], [208]. There is also a variety of propositions for chiplet-based accelerators for general graph processing [209]-[212], which hints for a large potential of leveraging their modularity and cost-efficiency for hardware accelerators tailored to knowledge graph processing.

As the number of compute cores in modern processors and accelerators is steadily increasing, networks-on-chips (NoCs) have emerged as a scalable **On-Chip Interconnect** solutions. In a NoC, data packets are sent through a series of links and routers, akin to computer networks [213]. The de facto standard topology for NoCs is a 2D mesh [214], [215], however, more elaborate topologies such as Slim NoC [216] or sparse Hamming graphs [217] have been proposed. Most accelerators for graph processing rely on these mesh topologies [218], [219], while some hardware architects argue that a mesh is not suitable for graph algorithms, as these algorithms often cause data movement between physically distant cores. One approach to tackle this challenge is the use of small-world networks [220] as NoC topologies for graph processing accelerators [221], [222]. We believe that there is significant value in a thorough investigation of traffic caused by knowledge graph processing and a subsequent evaluation of NoC topologies for knowledge graph accelerators.

Application-Specific Integrated Circuits (ASICs) mark a transformative shift in VLSI design, enabling systems to be embedded within single chips rather than assembled from multiple components [223]. This evolution, akin to the earlier microprocessor revolution, not only reshapes the electronics industry's design and manufacturing strategies but also interconnects designers, CAE tool developers, and ASIC vendors in intricate ways. Broadly, ASIC covers a spectrum from programmable logic devices (PLD) to gate arrays (GA), standard cells (SC), and full custom (FC) designs, with GA and SC being the most commonly referenced. We find it surprising that there are no works on designing ASICs for knowledge graphs, beyond the PIM-related works; it constitites a promising direction of future development.

#### **IX. CONCLUSION**

In this paper, we explore hardware acceleration for knowledge graph applications. We review the existing literature, identify main designs and trends in that area, benefits and drawbacks of hardware acceleration, as well as the challenges and opportunities for future research and development. We consider GPUs, FPGAs, Processing-In-Memory, RDMA, and other forms of acceleration. Our work can help design more efficient KG processing schemes.

#### ACKNOWLEDGMENTS

This project received funding from the European Research Council (Project PSAP, No. 101002047), and the European High-Performance Computing Joint Undertaking (JU) under grant agreement No. 955513 (MAELSTROM). This project was supported by the ETH Future Computing Laboratory (EFCL), financed by a donation from Huawei Technologies. This project received funding from the European Union's HE research and innovation programme under the grant agreement No. 101070141 (Project GLACIATION).

#### REFERENCES

- R. Reinanda, E. Meij, and M. de Rijke, "Knowledge Graphs: An Information Retrieval Perspective," Foundations and Trends<sup>®</sup> in Information Retrieval, vol. 14, no. 4, pp. 289–444, Oct. 2020. [Online]. Available: https://www.nowpublishers.com/article/Details/INR-063
- [2] P. R. Venkatesh, K. Chaitanya, R. Kumar, and P. R. Krishna, "Conversational Information Retrieval using Knowledge Graphs," in Proceedings of the CIKM 2022 Workshops - 1st Workshop On Proactive And Agent-Supported Information Retrieval (PASIR), ser. CEUR Workshop Proceedings, E. K. Georgios Drakopoulos, Ed., Atlanta, GA, USA, Oct. 2022, vol. 3318. [Online]. Available: https://ceur-ws.org/Vol-3318/paper8.pdf
- [3] P. Schneider, T. Schopf, J. Vladika, M. Galkin, E. Simperl, and F. Matthes, "A Decade of Knowledge Graphs in Natural Language Processing: A Survey," in Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), ser. AACL-IJCNLP '22, Y. He, H. Ji, S. Li, Y. Liu, and C.-H. Chang, Eds. Virtual Event: Association for Computational Linguistics, Nov. 2022, pp. 601–614. [Online]. Available: https://aclanthology.org/2022.aacl-main.46
- [4] N. Melluso, I. Grangel-González, and G. Fantoni, "Enhancing Industry 4.0 Standards Interoperability via Knowledge Graphs with Natural Language Processing," Computers in Industry, vol. 140, p. 103676, Apr. 2022. [Online]. Available: https://www.sciencedirect.com/science/ article/pii/S0166361522000732
- [5] I. Tiddi, F. Lécué, and P. Hitzler, Eds., Knowledge Graphs for eXplainable Artificial Intelligence: Foundations, Applications and Challenges, ser. Studies on the Semantic

Web. IOS Press, May 2020, vol. 47. [Online]. Available: https://www.iospress.com/catalog/books/knowledge-graphs-forexplainable-artificial-intelligence-foundations-applications-and

- [6] S. Schramm, C. Wehner, and U. Schmid, "Comprehensible Artificial Intelligence on Knowledge Graphs: A Survey," Apr. 2024, arXiv:2404.03499. [Online]. Available: https://arxiv.org/abs/2404.03499
- [7] J. Morss, "FPGAs vs. GPUs: A Tale of Two Accelerators," Dell Corporate Blog, Jan. 2019, accessed: April 30, 2024. [Online]. Available: https://www.dell.com/en-us/blog/fpgas-vs-gpus-tale-two-accelerators/
- [8] R. Hormuth, "FPGA's Use Cases in the Data Center," Dell Corporate Blog, Aug. 2017, accessed: April 30, 2024. [Online]. Available: https://www.dell.com/en-us/blog/fpgas-use-cases-in-the-data-center/
- [9] A. Hogan, E. Blomqvist, M. Cochez, C. D'amato, G. D. Melo, C. Gutierrez, S. Kirrane, J. E. L. Gayo, R. Navigli, S. Neumaier, A.-C. N. Ngomo, A. Polleres, S. M. Rashid, A. Rula, L. Schmelzeisen, J. Sequeda, S. Staab, and A. Zimmermann, "Knowledge Graphs," ACM Comput. Surv., vol. 54, no. 4, pp. 71:1–71:37, Jul. 2021. [Online]. Available: https://doi.org/10.1145/3447772
- [10] M. Besta, R. Gerstenberger, E. Peter, M. Fischer, M. Podstawski, C. Barthels, G. Alonso, and T. Hoefler, "Demystifying Graph Databases: Analysis and Taxonomy of Data Organization, System Designs, and Graph Queries," ACM Comput. Surv., vol. 56, no. 2, pp. 31:1–31:40, Sep. 2023. [Online]. Available: https://doi.org/10.1145/3604932
- [11] Q. Guo, F. Zhuang, C. Qin, H. Zhu, X. Xie, H. Xiong, and Q. He, "A Survey on Knowledge Graph-Based Recommender Systems," IEEE Transactions on Knowledge and Data Engineering, vol. 34, no. 8, pp. 3549–3568, Aug. 2022. [Online]. Available: https://doi.org/10.1109/ TKDE.2020.3028705
- [12] D. Zheng, X. Song, C. Ma, Z. Tan, Z. Ye, J. Dong, H. Xiong, Z. Zhang, and G. Karypis, "DGL-KE: Training Knowledge Graph Embeddings at Scale," in Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, ser. SIGIR '20. Virtual Event, China: Association for Computing Machinery, Jul. 2020, pp. 739–748. [Online]. Available: https://doi.org/10.1145/3397271.3401172
- [13] J. Sybrandt, M. Shtutman, and I. Safro, "MOLIERE: Automatic Biomedical Hypothesis Generation System," in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD '17. Halifax, NS, Canada: Association for Computing Machinery, Aug. 2017, pp. 1633–1642. [Online]. Available: https://doi.org/10.1145/3097983.3098057
- [14] A. Sheth, S. Padhee, and A. Gyrard, "Knowledge Graphs and Knowledge Networks: The Story in Brief," IEEE Internet Computing, vol. 23, no. 4, pp. 67–75, Jul. 2019. [Online]. Available: https: //ieeexplore.ieee.org/document/8874979
- [15] J. Qian, X.-Y. Li, C. Zhang, and L. Chen, "De-Anonymizing Social Networks and Inferring Private Attributes using Knowledge Graphs," in Proceedings of the 35th Annual IEEE International Conference on Computer Communications, ser. IEEE INFOCOM '16. San Francisco, CA, USA: IEEE Press, Apr. 2016, pp. 1–9. [Online]. Available: https://ieeexplore.ieee.org/document/7524578
- [16] Q. He, J. Yang, and B. Shi, "Constructing Knowledge Graph for Social Networks in A Deep and Holistic Way," in Companion Proceedings of the Web Conference 2020, ser. WWW '20. Taipei, Taiwan: Association for Computing Machinery, Apr. 2020, pp. 307–308. [Online]. Available: https://doi.org/10.1145/3366424.3383112
- [17] L. S. Nair and M. Shivani, "Knowledge Graph Based Question Answering System for Remote School Education," in Proceedings of the International Conference on Connected Systems & Intelligence, ser. CSI '22. Trivandrum, India: IEEE Press, Sep. 2022, pp. 1–5. [Online]. Available: https://ieeexplore.ieee.org/document/9924128
- [18] X. Liang, "The Construction of English Teaching Resource Base in Colleges and Universities Based on Knowledge Graphs," Applied Mathematics and Nonlinear Sciences, vol. 9, no. 1, pp. 1–19, Dec. 2023. [Online]. Available: https://sciendo.com/article/10.2478/amns-2024-0039
- [19] B. Albreiki, T. Habuza, N. Palakkal, and N. Zaki, "Clustering-Based Knowledge Graphs and Entity-Relation Representation Improves the Detection of at Risk Students," Education and Information Technologies, vol. 29, no. 6, pp. 6791–6820, Aug. 2023. [Online]. Available: https://link.springer.com/article/10.1007/s10639-023-11938-8
- [20] C. Grévisse, R. Manrique, O. Mariño, and S. Rothkugel, "Knowledge Graph-Based Teacher Support for Learning Material Authoring," in Proceedings of the 13th Colombian Conference (CCC '18), ser.

Communications in Computer and Information Science (CCIS), J. E. Serrano C. and J. C. Martínez-Santos, Eds. Cartagena, Colombia: Springer International Publishing, Sep. 2018, vol. 885, pp. 177–191. [Online]. Available: https://link.springer.com/chapter/10.1007/978-3-319-98998-3\_14

- [21] F. Teng, W. Yang, L. Chen, L. Huang, and Q. Xu, "Explainable Prediction of Medical Codes with Knowledge Graphs," Frontiers in Bioengineering and Biotechnology, vol. 8, pp. 867:1–867:11, Aug. 2020. [Online]. Available: https://www.frontiersin.org/journals/bioengineeringand-biotechnology/articles/10.3389/fbioe.2020.00867/full
- [22] D. Zhang, Q. Jia, S. Yang, X. Han, C. Xu, X. Liu, and Y. Xie, "Traditional Chinese Medicine Automated Diagnosis Based on Knowledge Graph Reasoning," Computers, Materials & Continua, vol. 71, no. 1, pp. 159–170, Nov. 2022. [Online]. Available: https://www.techscience.com/ cmc/v71n1/45357
- [23] M. Rotmensch, Y. Halpern, A. Tlimat, S. Horng, and D. Sontag, "Learning a Health Knowledge Graph from Electronic Medical Records," Scientific Reports, vol. 7, no. 1, pp. 5994:1–5994:11, Jul. 2017. [Online]. Available: https://www.nature.com/articles/s41598-017-05778-z
- [24] R. Gatta, M. Vallati, J. Lenkowicz, E. Rojas, A. Damiani, L. Sacchi, B. De Bari, A. Dagliati, C. Fernandez-Llatas, M. Montesi, A. Marchetti, M. Castellano, and V. Valentini, "Generating and Comparing Knowledge Graphs of Medical Processes Using pMineR," in Proceedings of the 9th Knowledge Capture Conference, ser. K-CAP '17. Austin, TX, USA: Association for Computing Machinery, Dec. 2017, pp. 36:1–36:4. [Online]. Available: https://doi.org/10.1145/3148011.3154464
- [25] B. Cope, M. Kalantzis, C. Zhai, A. Krussel, D. Searsmith, D. Ferguson, R. Tapping, and Y. Berrocal, "Maps of Medical Reason: Applying Knowledge Graphs and Artificial Intelligence in Medical Education and Practice," in Bioinformational Philosophy and Postdigital Knowledge Ecologies, ser. Postdigital Science and Education, M. A. Peters, P. Jandrić, and S. Hayes, Eds. Springer International Publishing, Apr. 2022, pp. 133–159. [Online]. Available: https://link.springer.com/ chapter/10.1007/978-3-030-95006-4\_8
- [26] L. Li, P. Wang, J. Yan, Y. Wang, S. Li, J. Jiang, Z. Sun, B. Tang, T.-H. Chang, S. Wang, and Y. Liu, "Real-World Data Medical Knowledge Graph: Construction and Applications," Artificial Intelligence in Medicine, vol. 103, p. 101817, Mar. 2020. [Online]. Available: https: //www.sciencedirect.com/science/article/abs/pii/S0933365719309546
- [27] H. Wang, Q. Zu, M. Lu, R. Chen, Z. Yang, Y. Gao, and J. Ding, "Application of Medical Knowledge Graphs in Cardiology and Cardiovascular Medicine: A Brief Literature Review," Advances in Therapy, vol. 39, no. 9, pp. 4052–4060, Jul. 2022. [Online]. Available: https://link.springer.com/article/10.1007/s12325-022-02254-7
- [28] J. Qu, "A Review on the Application of Knowledge Graph Technology in the Medical Field," Scientific Programming, vol. 2022, no. 1, pp. 3 212 370:1–3 212 370:12, 2022. [Online]. Available: https://onlinelibrary.wiley.com/doi/10.1155/2022/3212370
- [29] P. Chandak, K. Huang, and M. Zitnik, "Building a Knowledge Graph to Enable Precision Medicine," Scientific Data, vol. 10, no. 1, pp. 67:1–67:16, Feb. 2023. [Online]. Available: https://www.nature.com/ articles/s41597-023-01960-3
- [30] L. Hu, Z. Liu, Z. Zhao, L. Hou, L. Nie, and J. Li, "A Survey of Knowledge Enhanced Pre-Trained Language Models," IEEE Transactions on Knowledge and Data Engineering, vol. 36, no. 4, pp. 1413–1430, Apr. 2024. [Online]. Available: https://ieeexplore.ieee.org/document/ 10234662
- [31] S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang, and X. Wu, "Unifying Large Language Models and Knowledge Graphs: A Roadmap," IEEE Transactions on Knowledge and Data Engineering, vol. 36, no. 7, pp. 3580–3599, Jul. 2024. [Online]. Available: https://ieeexplore.ieee.org/document/10387715
- [32] X. Wei, S. Wang, D. Zhang, P. Bhatia, and A. Arnold, "Knowledge Enhanced Pretrained Language Models: A Compreshensive Survey," Oct. 2021, arXiv:2110.08455. [Online]. Available: https://arxiv.org/abs/ 2110.08455
- [33] J. Yang, X. Hu, G. Xiao, and Y. Shen, "A Survey of Knowledge Enhanced Pre-trained Models," Oct. 2023, arXiv:2110.00269. [Online]. Available: https://arxiv.org/abs/2110.00269
- [34] L. Yang, H. Chen, Z. Li, X. Ding, and X. Wu, "Give Us the Facts: Enhancing Large Language Models with Knowledge Graphs for Factaware Language Modeling," Jan. 2024, arXiv:2306.11489. [Online]. Available: https://arxiv.org/abs/2306.11489

- [35] Y. Wang, N. Lipka, R. A. Rossi, A. Siu, R. Zhang, and T. Derr, "Knowledge Graph Prompting for Multi-Document Question Answering," Dec. 2023, arXiv:2308.11730. [Online]. Available: https://arxiv.org/abs/2308.11730
- [36] Y. Tian, H. Song, Z. Wang, H. Wang, Z. Hu, F. Wang, N. V. Chawla, and P. Xu, "Graph Neural Prompting with Large Language Models," Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 17, pp. 19080–19088, Mar. 2024. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/29875
- [37] J. Sun, C. Xu, L. Tang, S. Wang, C. Lin, Y. Gong, L. Ni, H.-Y. Shum, and J. Guo, "Think-on-Graph: Deep and Responsible Reasoning of Large Language Model on Knowledge Graph," in Proceedings of the 12th International Conference on Learning Representations, ser. ICLR '24, Vienna, Austria, May 2024, pp. 1–31. [Online]. Available: https://openreview.net/forum?id=nnVO1PvbTv
- [38] C. Feng, X. Zhang, and Z. Fei, "Knowledge Solver: Teaching LLMs to Search for Domain Knowledge from Knowledge Graphs," Sep. 2023, arXiv:2309.03118. [Online]. Available: https://arxiv.org/abs/2309.03118
- [39] X. Wang, Q. Yang, Y. Qiu, J. Liang, Q. He, Z. Gu, Y. Xiao, and W. Wang, "KnowledGPT: Enhancing Large Language Models with Retrieval and Storage Access on Knowledge Bases," Aug. 2023, arXiv:2308.11761. [Online]. Available: https://arxiv.org/abs/2308.11761
- [40] R. Brate, M.-H. Dang, F. Hoppe, Y. He, A. Meroño-Peñuela, and V. Sadashivaiah, "Improving Language Model Predictions via Prompts Enriched with Knowledge Graphs," in Proceedings of the Workshop on Deep Learning for Knowledge Graphs, ser. DL4KG@ISWC '22, Hangzhou, China, Oct. 2022. [Online]. Available: https://alammehwish.github.io/dl4kg2022/papers/paper-3.pdf
- [41] L. Luo, Y.-F. Li, G. Haffari, and S. Pan, "Reasoning on Graphs: Faithful and Interpretable Large Language Model Reasoning," in Proceedings of the 12th International Conference on Learning Representations, ser. ICLR '24, Vienna, Austria, May 2024, pp. 1–24. [Online]. Available: https://openreview.net/forum?id=ZGNWW7xZ6Q
- [42] Y. Zhu, X. Wang, J. Chen, S. Qiao, Y. Ou, Y. Yao, S. Deng, H. Chen, and N. Zhang, "LLMs for Knowledge Graph Construction and Reasoning: Recent Capabilities and Future Opportunities," Aug. 2024, arXiv:2305.13168. [Online]. Available: https://arxiv.org/abs/2305.13168
- [43] Y. Wen, Z. Wang, and J. Sun, "MindMap: Knowledge Graph Prompting Sparks Graph of Thoughts in Large Language Models," Mar. 2024, arXiv:2308.09729. [Online]. Available: https://arxiv.org/abs/2308.09729
- [44] M. Besta, A. Kubicek, R. Niggli, R. Gerstenberger, L. Weitzendorf, M. Chi, P. Iff, J. Gajda, P. Nyczyk, J. Müller, H. Niewiadomski, M. Chrapek, M. Podstawski, and T. Hoefler, "Multi-Head RAG: Solving Multi-Aspect Problems with LLMs," Jun. 2024, arXiv:2406.05085. [Online]. Available: https://arxiv.org/abs/2406.05085
- [45] X. Jiang, R. Zhang, Y. Xu, R. Qiu, Y. Fang, Z. Wang, J. Tang, H. Ding, X. Chu, J. Zhao, and Y. Wang, "HyKGE: A Hypothesis Knowledge Graph Enhanced Framework for Accurate and Reliable Medical LLMs Responses," Apr. 2024, arXiv:2312.15883. [Online]. Available: https://arxiv.org/abs/2312.15883
- [46] J. Delile, S. Mukherjee, A. V. Pamel, and L. Zhukov, "Graph-Based Retriever Captures the Long Tail of Biomedical Knowledge," Feb. 2024, arXiv:2402.12352. [Online]. Available: https://arxiv.org/abs/2402.12352
- [47] M. M. Hussien, A. N. Melo, A. L. Ballardini, C. S. Maldonado, R. Izquierdo, and M. Ángel Sotelo, "RAG-based Explainable Prediction of Road Users Behaviors for Automated Driving using Knowledge Graphs and Large Language Models," May 2024, arXiv:2405.00449. [Online]. Available: https://arxiv.org/abs/2405.00449
- [48] T. Bui, O. Tran, P. Nguyen, B. Ho, L. Nguyen, T. Bui, and T. Quan, "Cross-Data Knowledge Graph Construction for LLM-enabled Educational Question-Answering System: A Case Study at HCMUT," Apr. 2024, arXiv:2404.09296. [Online]. Available: https://arxiv.org/abs/ 2404.09296
- [49] Z. Xu, M. J. Cruz, M. Guevara, T. Wang, M. Deshpande, X. Wang, and Z. Li, "Retrieval-Augmented Generation with Knowledge Graphs for Customer Service Question Answering," in Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, ser. SIGIR '24. Washington, DC, USA: Association for Computing Machinery, Jul. 2024, pp. 2905–2909. [Online]. Available: https://doi.org/10.1145/3626772.3661370
- [50] P. Sarthi, S. Abdullah, A. Tuli, S. Khanna, A. Goldie, and C. D. Manning, "RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval," Jan. 2024, arXiv:2401.18059. [Online]. Available: https://arxiv.org/abs/2401.18059

- [51] D. Edge, H. Trinh, N. Cheng, J. Bradley, A. Chao, A. Mody, S. Truitt, and J. Larson, "From Local to Global: A Graph RAG Approach to Query-Focused Summarization," Apr. 2024, arXiv:2404.16130. [Online]. Available: https://arxiv.org/abs/2404.16130
- [52] F. Zhang, X. Liu, J. Tang, Y. Dong, P. Yao, J. Zhang, X. Gu, Y. Wang, E. Kharlamov, B. Shao, R. Li, and K. Wang, "OAG: Linking Entities Across Large-Scale Heterogeneous Knowledge Graphs," IEEE Transactions on Knowledge and Data Engineering, vol. 35, no. 9, pp. 9225–9239, Sep. 2023. [Online]. Available: https://doi.org/10.1109/ TKDE.2022.3222168
- [53] T. Wu, A. Khan, M. Yong, G. Qi, and M. Wang, "Efficiently Embedding Dynamic Knowledge Graphs," Knowledge-Based Systems, vol. 250, p. 109124, Aug. 2022. [Online]. Available: https://www.sciencedirect.com/ science/article/pii/S0950705122005548
- [54] G. Weikum, "Knowledge Graphs 2021: A Data Odyssey," Proc. VLDB Endow., vol. 14, no. 12, pp. 3233–3238, Jul. 2021. [Online]. Available: https://doi.org/10.14778/3476311.3476393
- [55] H. Chen, G. Cao, J. Chen, and J. Ding, "A Practical Framework for Evaluating the Quality of Knowledge Graph," in Knowledge Graph and Semantic Computing: Knowledge Computing and Language Understanding, ser. CCKS '19, X. Zhu, B. Qin, X. Zhu, M. Liu, and L. Qian, Eds. Hangzhou, China: Springer Singapore, Aug. 2019, pp. 111–122. [Online]. Available: https://doi.org/10.1007/978-981-15-1956-7\_10
- [56] C. Peng, F. Xia, M. Naseriparsa, and F. Osborne, "Knowledge Graphs: Opportunities and Challenges," Artificial Intelligence Review, vol. 56, no. 11, pp. 13071–13102, Nov. 2023. [Online]. Available: https://doi.org/10.1007/s10462-023-10465-9
- [57] J. Dongarra, M. Meuer, H. Simon, and E. Strohmaier, "Top 500 Supercomputers November 2023," Nov. 2023, accessed: April 30, 2024. [Online]. Available: https://top500.org/lists/top500/2023/11/
- [58] NVIDIA, "cuGraph RAPIDS Graph Analytics Library," 2020, accessed: July 19, 2024. [Online]. Available: https://github.com/rapidsai/ cugraph
- [59] S. Breß, M. Heimel, N. Siegmund, L. Bellatreche, and G. Saake, "GPU-Accelerated Database Systems: Survey and Open Challenges," in Transactions on Large-Scale Data- and Knowledge-Centered Systems XV: Selected Papers from ADBIS 2013 Satellite Events, ser. Lecture Notes in Computer Science (LNCS), A. Hameurlain, J. Küng, R. Wagner, B. Catania, G. Guerrini, T. Palpanas, J. Pokorný, and A. Vakali, Eds. Springer Berlin Heidelberg, 2014, vol. 8920, pp. 1–35. [Online]. Available: https://doi.org/10.1007/978-3-662-45761-0\_1
- [60] X. Shi, Z. Zheng, Y. Zhou, H. Jin, L. He, B. Liu, and Q.-S. Hua, "Graph Processing on GPUs: A Survey," ACM Comput. Surv., vol. 50, no. 6, pp. 81:1–81:35, Jan. 2018. [Online]. Available: https://doi.org/10.1145/3128571
- [61] G. Karypis and V. Kumar, "A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs," SIAM Journal on Scientific Computing, vol. 20, no. 1, pp. 359–392, 1998. [Online]. Available: https://doi.org/10.1137/S1064827595287997
- [62] Z. Zhu, S. Xu, J. Tang, and M. Qu, "GraphVite: A High-Performance CPU-GPU Hybrid System for Node Embedding," in Proceedings of the World Wide Web Conference, ser. WWW '19. San Francisco, CA, USA: Association for Computing Machinery, May 2019, pp. 2494–2504. [Online]. Available: https://doi.org/10.1145/3308558.3313508
- [63] J. Mohoney, R. Waleffe, H. Xu, T. Rekatsinas, and S. Venkataraman, "Marius: Learning Massive Graph Embeddings on a Single Machine," in Proceedings of the 15th USENIX Symposium on Operating Systems Design and Implementation, ser. OSDI '21. USENIX Association, Jul. 2021, pp. 533–549. [Online]. Available: https: //www.usenix.org/conference/osdi21/presentation/mohoney
- [64] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko, "Translating Embeddings for Modeling Multi-Relational Data," in Proceedings of the Conference on Neural Information Processing Systems (NIPS '13), ser. Advances in Neural Information Processing Systems, C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, Eds. Lake Tahoe, NV, USA: Curran Associates, Dec. 2013, vol. 26, pp. 2787– 2795. [Online]. Available: https://proceedings.neurips.cc/paper\_files/ paper/2013/file/1cecc7a77928ca8133fa24680a88d2f9-Paper.pdf
- [65] Z. Sun, Z.-H. Deng, J.-Y. Nie, and J. Tang, "RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space," in Proceedings of the 7th International Conference on Learning Representations, ser.

ICLR '19, New Orleans, LA, USA, May 2019, pp. 1–18. [Online]. Available: https://openreview.net/forum?id=HkgEQnRqYQ

- [66] C.-H. Lee, D.-O. Kang, and H. J. Song, "Fast Knowledge Graph Completion using Graphics Processing Units," Jul. 2023, arXiv:2307.12059. [Online]. Available: https://arxiv.org/abs/2307.12059
- [67] M. Besta, F. Scheidl, L. Gianinazzi, S. Klaiman, J. Müller, and T. Hoefler, "Demystifying Higher-Order Graph Neural Networks," Jun. 2024, arXiv:2406.12841. [Online]. Available: https://arxiv.org/abs/2406.12841
- [68] Z. Ye, Y. J. Kumar, G. O. Sing, F. Song, and J. Wang, "A Comprehensive Survey of Graph Neural Networks for Knowledge Graphs," IEEE Access, vol. 10, pp. 75729–75741, 2022. [Online]. Available: https://doi.org/10.1109/ACCESS.2022.3191784
- [69] Y. Wang, B. Feng, Z. Wang, G. Huang, and Y. Ding, "TC-GNN: Bridging Sparse GNN Computation and Dense Tensor Cores on GPUs," in Proceedings of the 2023 USENIX Annual Technical Conference, ser. USENIX ATC '23. Boston, MA, USA: USENIX Association, Jul. 2023, pp. 149–164. [Online]. Available: https: //www.usenix.org/conference/atc23/presentation/wang-yuke
- [70] C. Wang, D. Sun, and Y. Bai, "PiPAD: Pipelined and Parallel Dynamic GNN Training on GPUs," in Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, ser. PPoPP '23. Montreal, QC, Canada: Association for Computing Machinery, Feb. 2023, pp. 405–418. [Online]. Available: https://doi.org/10.1145/3572848.3577487
- [71] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. van den Berg, I. Titov, and M. Welling, "Modeling Relational Data with Graph Convolutional Networks," in The Semantic Web (ESWC '18), ser. Lecture Notes in Computer Science (LNCS), A. Gangemi, R. Navigli, M.-E. Vidal, P. Hitzler, R. Troncy, L. Hollink, A. Tordai, and M. Alam, Eds. Heraklion, Greece: Springer International Publishing, Jun. 2018, vol. 10843, pp. 593–607. [Online]. Available: https://link.springer.com/ chapter/10.1007/978-3-319-93417-4\_38
- [72] N. Sheikh, X. Qin, B. Reinwald, and C. Lei, "Scaling Knowledge Graph Embedding Models for Link Prediction," in Proceedings of the 2nd European Workshop on Machine Learning and Systems, ser. EuroMLSys '22. Rennes, France: Association for Computing Machinery, Apr. 2022, pp. 87–94. [Online]. Available: https://doi.org/ 10.1145/3517207.3526974
- [73] H. Chen, X. Li, K. Zhou, X. Hu, C.-C. M. Yeh, Y. Zheng, and H. Yang, "TinyKG: Memory-Efficient Training Framework for Knowledge Graph Neural Recommender Systems," in Proceedings of the 16th ACM Conference on Recommender Systems, ser. RecSys '22. Seattle, WA, USA: Association for Computing Machinery, Sep. 2022, pp. 257–267. [Online]. Available: https://doi.org/10.1145/3523227.3546760
- [74] R. Agrawal and R. Srikant, "Fast Algorithms for Mining Association Rules in Large Databases," in Proceedings of the 20th International Conference on Very Large Data Bases, ser. VLDB '94. Santiago de Chile, Chile: Morgan Kaufmann, Sep. 1994, pp. 487–499. [Online]. Available: https://dl.acm.org/doi/10.5555/645920.672836
- [75] L. Bustio-Martínez, R. Cumplido, M. Letras, R. Hernández-León, C. Feregrino-Uribe, and J. Hernández-Palancar, "FPGA/GPU-Based Acceleration for Frequent Itemsets Mining: A Comprehensive Review," ACM Comput. Surv., vol. 54, no. 9, pp. 179:1–179:35, Oct. 2021. [Online]. Available: https://doi.org/10.1145/3472289
- [76] Y. Yang, D. Agrawal, H. V. Jagadish, A. K. H. Tung, and S. Wu, "An Efficient Parallel Keyword Search Engine on Knowledge Graphs," in Proceedings of the IEEE 35th International Conference on Data Engineering, ser. ICDE '19. Macao: IEEE Press, Apr. 2019, pp. 338– 349. [Online]. Available: https://doi.org/10.1109/ICDE.2019.00038
- [77] H. Ren, H. Dai, B. Dai, X. Chen, D. Zhou, J. Leskovec, and D. Schuurmans, "SMORE: Knowledge Graph Completion and Multi-hop Reasoning in Massive Knowledge Graphs," Nov. 2021, arXiv:2110.14890. [Online]. Available: https://arxiv.org/abs/2110.14890
- [78] R. Kannan, P. Sao, H. Lu, D. Herrmannova, V. Thakkar, R. Patton, R. Vuduc, and T. Potok, "Scalable Knowledge Graph Analytics at 136 Petaflop/s," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '20. Virtual Event: IEEE Press, Nov. 2020, pp. 6:1–6:13. [Online]. Available: https://doi.org/10.1109/SC41405.2020.00010
- [79] P. Sao, R. Kannan, P. Gera, and R. Vuduc, "A Supernodal All-Pairs Shortest Path Algorithm," in Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPOPP '20. San Diego, CA, USA: Association for Computing

Machinery, Feb. 2020, pp. 250–261. [Online]. Available: https://doi.org/10.1145/3332466.3374533

- [80] H. Li, Y. Wang, S. Zhang, Y. Song, and H. Qu, "KG4Vis: A Knowledge Graph-Based Approach for Visualization Recommendation," IEEE Transactions on Visualization & Computer Graphics, vol. 28, no. 01, pp. 195–205, Jan. 2022. [Online]. Available: https://doi.org/ 10.1109/TVCG.2021.3114863
- [81] V. G. Bulavintsev and D. D. Zhdanov, "Adaptation of Algorithms for Efficient Execution on GPUs," in Optical Design and Testing XI, ser. Proceedings. Nantong, JS, China: SPIE, Oct. 2021, vol. 11895, pp. 159–166. [Online]. Available: https://doi.org/10.1117/12.2601619
- [82] V. G. Bulavintsev, "Flattening of Data-Dependent Nested Loops for Compile-Time Optimization of GPU Programs," International Journal of Open Information Technologies, vol. 7, no. 9, pp. 7–13, 2019. [Online]. Available: http://www.injoit.org/index.php/j1/article/view/761
- [83] V. G. Bulavintsev, O. Zaikin, P. Petrov, and M. Posypkin, "A GPU-Enabled Black-Box Optimization in Application to Dispersion-Based Geoacoustic Inversion," in Proceedings of the VIII International Conference on Optimization and Applications (OPTIMA '17), ser. CEUR Workshop Proceedings, Y. G. Evtushenko, M. Y. Khachay, O. V. Khamisov, Y. A. Kochetov, V. U. Malkova, and M. A. Posypkin, Eds. Petrovac, Montenegro: RWTH Aachen, Oct. 2017, vol. 1987, pp. 95–100. [Online]. Available: https://ceur-ws.org/Vol-1987/paper15.pdf
- [84] J. de Fine Licht, M. Blott, and T. Hoefler, "Designing Scalable FPGA Architectures Using High-Level Synthesis," SIGPLAN Not., vol. 53, no. 1, pp. 403–404, Feb. 2018. [Online]. Available: https://doi.org/10.1145/3200691.3178527
- [85] H. Yan and D. Li, "Knowledge Graph Based on Defect Detection of Aerospace FPGA Software Code," in Proceedings of the 2nd International Conference on Computer Science and Management Technology, ser. ICCSMT '21. Shanghai, China: IEEE Press, Nov. 2021, pp. 261–266. [Online]. Available: https://ieeexplore.ieee.org/ document/9786986
- [86] Q. Wang, L. Zheng, Y. Huang, P. Yao, C. Gui, X. Liao, H. Jin, W. Jiang, and F. Mao, "GraSU: A Fast Graph Update Library for FPGA-Based Dynamic Graph Processing," in Proceedings of the 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, ser. FPGA '21. Virtual Event, USA: Association for Computing Machinery, Mar. 2021, pp. 149–159. [Online]. Available: https://doi.org/10.1145/3431920.3439288
- [87] K. Miura, R. Kobayashi, T. Amagasa, H. Kitagawa, N. Fujita, and T. Boku, "An FPGA-based Accelerator for Regular Path Queries over Edge-labeled Graphs," in Proceedings of the 2022 IEEE International Conference on Big Data, ser. Big Data '22. Osaka, Japan: IEEE Press, Dec. 2022, pp. 415–422. [Online]. Available: https://doi.org/10.1109/BigData55660.2022.10020406
- [88] S. Werner, D. Heinrich, J. Piper, S. Groppe, R. Backasch, C. Blochwitz, and T. Pionteck, "Automated Composition and Execution of Hardware-Accelerated Operator Graphs," in Proceedings of the 10th International Symposium on Reconfigurable Communication-centric Systems-on-Chip, ser. ReCoSoC '15. Bremen, Germany: IEEE Press, Jun. 2015, pp. 1–8. [Online]. Available: https://ieeexplore.ieee.org/document/7238078
- [89] S. Heidari, Y. Simmhan, R. N. Calheiros, and R. Buyya, "Scalable Graph Processing Frameworks: A Taxonomy and Open Challenges," ACM Comput. Surv., vol. 51, no. 3, pp. 60:1–60:53, Jun. 2018. [Online]. Available: https://doi.org/10.1145/3199523
- [90] C.-Y. Gui, L. Zheng, B. He, C. Liu, X.-Y. Chen, X.-F. Liao, and H. Jin, "A Survey on Graph Processing Accelerators: Challenges and Opportunities," Journal of Computer Science and Technology, vol. 34, pp. 339–371, Mar. 2019. [Online]. Available: https://link.springer.com/ article/10.1007/s11390-019-1914-z
- [91] M. Besta, D. Stanojevic, J. de Fine Licht, T. Ben-Nun, and T. Hoefler, "Graph Processing on FPGAs: Taxonomy, Survey, Challenges," Apr. 2019, arXiv:1903.06697. [Online]. Available: https://arxiv.org/abs/ 1903.06697
- [92] M. Agostini, F. O'Brien, and T. Abdelrahman, "Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems," in Proceedings of the 49th International Conference on Parallel Processing, ser. ICPP '20. Edmonton, AB, Canada: Association for Computing Machinery, Aug. 2020, pp. 50:1–50:12. [Online]. Available: https://doi.org/10.1145/3404397.3404433
- [93] H. Giefers, P. Staar, and R. Polig, "Energy-Efficient Stochastic Matrix Function Estimator for Graph Analytics on FPGA," in Proceedings of the 26th International Conference on Field Programmable Logic and

Applications, ser. FPL '16. Lausanne, Switzerland: IEEE Press, Aug. 2016, pp. 1–9. [Online]. Available: https://ieeexplore.ieee.org/document/7577350

- [94] H. Giefers, P. Staar, C. Bekas, and C. Hagleitner, "Analyzing the Energy-Efficiency of Sparse Matrix Multiplication on Heterogeneous Systems: A Comparative Study of GPU, Xeon Phi and FPGA," in Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, ser. ISPASS '16. Uppsala, Sweden: IEEE Press, Apr. 2016, pp. 46–56. [Online]. Available: https://ieeexplore.ieee.org/document/7482073
- [95] W. Yuan, T. Tian, Q. Wu, and X. Jin, "QEGCN: An FPGA-Based Accelerator for Quantized GCNs with Edge-Level Parallelism," Journal of Systems Architecture, vol. 129, p. 102596, Aug. 2022. [Online]. Available: https://doi.org/10.1016/j.sysarc.2022.102596
- [96] J. de Fine Licht and T. Hoefler, "hlslib: Software Engineering for Hardware Design," in Proceedings of the Fifth International Workshop on Heterogeneous High-performance Reconfigurable Computing, ser. H2RC '19. Denver, CO, USA: IEEE Press, Nov. 2019.
- [97] J. de Fine Licht, G. Kwasniewski, and T. Hoefler, "Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis," in Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, ser. FPGA '20. Seaside, CA, USA: Association for Computing Machinery, Feb. 2020, pp. 244–254. [Online]. Available: https://doi.org/10.1145/3373087.3375296
- [98] T. De Matteis, J. de Fine Licht, and T. Hoefler, "FBLAS: Streaming Linear Algebra on FPGA," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '20. Atlanta, GA, USA: IEEE Press, Nov. 2020, pp. 59:1–59:13. [Online]. Available: https://ieeexplore.ieee.org/document/ 9355265
- [99] C.-J. Johnsen, T. De Matteis, T. Ben-Nun, J. de Fine Licht, and T. Hoefler, "Temporal Vectorization: A Compiler Approach to Automatic Multi-Pumping," in Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design, ser. ICCAD '22. San Diego, California: Association for Computing Machinery, Nov. 2022, pp. 85:1– 85:9. [Online]. Available: https://doi.org/10.1145/3508352.3549374
- [100] J. de Fine Licht, C. A. Pattison, A. N. Ziogas, D. Simmons-Duffin, and T. Hoefler, "Fast Arbitrary Precision Floating Point on FPGA," in Proceedings of the IEEE 30th Annual International Symposium on Field-Programmable Custom Computing Machines, ser. FCCM '22. New York, NY, USA: IEEE Press, May 2022, pp. 1–9. [Online]. Available: https://ieeexplore.ieee.org/document/9786219
- [101] O. Mutlu, S. Ghose, J. Gómez-Luna, and R. Ausavarungnirun, "A Modern Primer on Processing in Memory," in Emerging Computing: From Devices to Systems - Looking Beyond Moore and Von Neumann, ser. Computer Architecture and Design Methodologies (CADM), M. M. S. Aly and A. Chattopadhyay, Eds. Springer Nature Singapore, Jul. 2022, pp. 171–243. [Online]. Available: https://doi.org/10.1007/978-981-16-7487-7\_7
- [102] S. Ghose, A. Boroumand, J. S. Kim, J. Gómez-Luna, and O. Mutlu, "Processing-in-Memory: A Workload-Driven Perspective," IBM Journal of Research and Development, vol. 63, no. 6, pp. 3:1–3:19, Nov. 2019. [Online]. Available: https://ieeexplore.ieee.org/document/8792187
- [103] A. Boroumand, S. Ghose, B. Akin, R. Narayanaswami, G. F. Oliveira, X. Ma, E. Shiu, and O. Mutlu, "Google Neural Network Models for Edge Devices: Analyzing and Mitigating Machine Learning Inference Bottlenecks," in Proceedings of the 30th International Conference on Parallel Architectures and Compilation Techniques, ser. PACT '21. Virtual Event: IEEE Press, Sep. 2021, pp. 159–172. [Online]. Available: https://ieeexplore.ieee.org/document/9563028
- [104] M. Besta, R. Kanakagiri, G. Kwaśniewski, R. Ausavarungnirun, J. Beránek, K. Kanellopoulos, K. Janda, Z. Vonarburg-Shmaria, L. Gianinazzi, I. Stefan, J. G. Luna, J. Golinowski, M. Copik, L. Kapp-Schwoerer, S. Di Girolamo, N. Blach, M. Konieczny, O. Mutlu, and T. Hoefler, "SISA: Set-Centric Instruction Set Architecture for Graph Mining on Processing-in-Memory Systems," in Proceedings of the 54th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO '21. Virtual Event, Greece: Association for Computing Machinery, Oct. 2021, pp. 282–297. [Online]. Available: https: //doi.org/10.1145/3466752.3480133
- [105] R. Nair, S. F. Antao, C. Bertolli, P. Bose, J. R. Brunheroto, T. Chen, C.-Y. Cher, C. H. A. Costa, J. Doi, C. Evangelinos, B. M. Fleischer, T. W. Fox, D. S. Gallo, L. Grinberg, J. A. Gunnels, A. C. Jacob, P. Jacob, H. M. Jacobson, T. Karkhanis, C. Kim, J. H. Moreno, J. K.

O'Brien, M. Ohmacht, Y. Park, D. A. Prener, B. S. Rosenburg, K. D. Ryu, O. Sallenave, M. J. Serrano, P. D. M. Siegl, K. Sugavanam, and Z. Sura, "Active Memory Cube: A Processing-in-Memory Architecture for Exascale Systems," IBM Journal of Research and Development, vol. 59, no. 2/3, pp. 17:1–17:14, Mar. 2015. [Online]. Available: https://ieeexplore.ieee.org/document/7095154

- [106] S. Cho, H. Choi, E. Park, H. Shin, and S. Yoo, "McDRAM v2: In-Dynamic Random Access Memory Systolic Array Accelerator to Address the Large Model Problem in Deep Neural Networks on the Edge," IEEE Access, vol. 8, pp. 135 223–135 243, 2020. [Online]. Available: https://ieeexplore.ieee.org/document/9146167
- [107] A. Denzler, G. F. Oliveira, N. Hajinazar, R. Bera, G. Singh, J. Gómez-Luna, and O. Mutlu, "Casper: Accelerating Stencil Computations Using Near-Cache Processing," IEEE Access, vol. 11, pp. 22136–22154, 2023. [Online]. Available: https://ieeexplore.ieee.org/document/10058509
- [108] G. F. Oliveira, J. Gómez-Luna, L. Orosa, S. Ghose, N. Vijaykumar, I. Fernandez, M. Sadrosadati, and O. Mutlu, "DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks," IEEE Access, vol. 9, pp. 134 457–134 502, 2021. [Online]. Available: https://ieeexplore.ieee.org/document/9530719
- [109] C. Giannoula, N. Vijaykumar, N. Papadopoulou, V. Karakostas, I. Fernandez, J. Gómez-Luna, L. Orosa, N. Koziris, G. Goumas, and O. Mutlu, "SynCron: Efficient Synchronization Support for Near-Data-Processing Architectures," in Proceedings of the IEEE International Symposium on High-Performance Computer Architecture, ser. HPCA '21. Virtual Event, South Korea: IEEE Press, Mar. 2021, pp. 263–276. [Online]. Available: https://ieeexplore.ieee.org/document/9407213
- [110] L. Orosa, Y. Wang, M. Sadrosadati, J. S. Kim, M. Patel, I. Puddu, H. Luo, K. Razavi, J. Gómez-Luna, H. Hassan, N. Mansouri-Ghiasi, S. Ghose, and O. Mutlu, "CODIC: A Low-Cost Substrate for Enabling Custom In-DRAM Functionalities and Optimizations," in Proceedings of the ACM/IEEE 48th Annual International Symposium on Computer Architecture, ser. ISCA '21. Virtual Event: IEEE Press, Jun. 2021, pp. 484–497. [Online]. Available: https://ieeexplore.ieee.org/document/ 9499751
- [111] Y. Xi, B. Gao, J. Tang, A. Chen, M.-F. Chang, X. S. Hu, J. Van Der Spiegel, H. Qian, and H. Wu, "In-Memory Learning With Analog Resistive Switching Memory: A Review and Perspective," Proceedings of the IEEE, vol. 109, no. 1, pp. 14–42, Jan. 2021. [Online]. Available: https://ieeexplore.ieee.org/document/9138706
- [112] P. Girard, Y. Cheng, A. Virazel, W. Zhao, R. Bishnoi, and M. B. Tahoori, "A Survey of Test and Reliability Solutions for Magnetic Random Access Memories," Proceedings of the IEEE, vol. 109, no. 2, pp. 149–169, Feb. 2021. [Online]. Available: https://ieeexplore.ieee.org/document/9240959
- [113] V. Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand, J. Kim, M. A. Kozuch, O. Mutlu, P. B. Gibbons, and T. C. Mowry, "Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology," in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO '17. Cambridge, MA, USA: Association for Computing Machinery, Oct. 2017, pp. 273–287. [Online]. Available: https: //doi.org/10.1145/3123939.3124544
- [114] N. Hajinazar, G. F. Oliveira, S. Gregorio, J. a. D. Ferreira, N. M. Ghiasi, M. Patel, M. Alser, S. Ghose, J. Gómez-Luna, and O. Mutlu, "SIMDRAM: A Framework for Bit-Serial SIMD Processing Using DRAM," in Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS '21. Association for Computing Machinery, Apr. 2021, pp. 329–345. [Online]. Available: https: //doi.org/10.1145/3445814.3446749
- [115] J. Gómez-Luna, Y. Guo, S. Brocard, J. Legriel, R. Cimadomo, G. F. Oliveira, G. Singh, and O. Mutlu, "Evaluating Machine Learning Workloads on Memory-Centric Computing Systems," in Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, ser. ISPASS '23. Raleigh, NC, USA: IEEE Press, Apr. 2023, pp. 35–49. [Online]. Available: https://ieeexplore.ieee.org/document/10158216
- [116] M. Item, J. Gómez-Luna, Y. Guo, G. F. Oliveira, M. Sadrosadati, and O. Mutlu, "TransPimLib: Efficient Transcendental Functions for Processing-in-Memory Systems," in Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, ser. ISPASS '23. Raleigh, NC, USA: IEEE Press, Apr. 2023, pp. 235–247. [Online]. Available: https://ieeexplore.ieee.org/document/ 10158230

- [117] S. Diab, A. Nassereldine, M. Alser, J. Gómez-Luna, O. Mutlu, and I. El Hajj, "A Framework for High-throughput Sequence Alignment using Real Processing-in-Memory Systems," Bioinformatics, vol. 39, no. 5, pp. 1–8, May 2023. [Online]. Available: https: //doi.org/10.1093/bioinformatics/btad155
- [118] C. Giannoula, I. Fernandez, J. G. Luna, N. Koziris, G. Goumas, and O. Mutlu, "SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Architectures," Proc. ACM Meas. Anal. Comput. Syst., vol. 6, no. 1, pp. 21:1–21:49, Feb. 2022. [Online]. Available: https://doi.org/10.1145/3508041
- [119] J. Gómez-Luna, I. El Hajj, I. Fernandez, C. Giannoula, G. F. Oliveira, and O. Mutlu, "Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System," IEEE Access, vol. 10, pp. 52565–52608, 2022. [Online]. Available: https://ieeexplore.ieee.org/document/9771457
- [120] D. Niu, S. Li, Y. Wang, W. Han, Z. Zhang, Y. Guan, T. Guan, F. Sun, F. Xue, L. Duan, Y. Fang, H. Zheng, X. Jiang, S. Wang, F. Zuo, Y. Wang, B. Yu, Q. Ren, and Y. Xie, "184QPS/W 64Mb/mm<sup>2</sup> 3D Logic-to-DRAM Hybrid Bonding with Process-Near-Memory Engine for Recommendation System," in Proceedings of the IEEE International Solid-State Circuits Conference, ser. ISSCC '22, vol. 65. San Francisco, CA, USA: IEEE Press, Feb. 2022, pp. 1–3. [Online]. Available: https://ieeexplore.ieee.org/document/9731694
- [121] S. Lee, K. Kim, S. Oh, J. Park, G. Hong, D. Ka, K. Hwang, J. Park, K. Kang, J. Kim, J. Jeon, N. Kim, Y. Kwon, K. Vladimir, W. Shin, J. Won, M. Lee, H. Joo, H. Choi, J. Lee, D. Ko, Y. Jun, K. Cho, I. Kim, C. Song, C. Jeong, D. Kwon, J. Jang, I. Park, J. Chun, and J. Cho, "A 1ynm 1.25V 8Gb, 16Gb/s/pin GDDR6-based Accelerator-in-Memory supporting 1TFLOPS MAC Operation and Various Activation Functions for Deep-Learning Applications," in Proceedings of the IEEE International Solid-State Circuits Conference, ser. ISSCC '22, vol. 65. San Francisco, CA, USA: IEEE Press, Feb. 2022, pp. 1–3. [Online]. Available: https://ieeexplore.ieee.org/document/9731711
- [122] L. Ke, X. Zhang, J. So, J.-G. Lee, S.-H. Kang, S. Lee, S. Han, Y. Cho, J. H. Kim, Y. Kwon, K. Kim, J. Jung, I. Yun, S. J. Park, H. Park, J. Song, J. Cho, K. Sohn, N. S. Kim, and H.-H. S. Lee, "Near-Memory Processing in Action: Accelerating Personalized Recommendation with AxDIMM," IEEE Micro, vol. 42, no. 1, pp. 116–127, Jan. 2022. [Online]. Available: https://ieeexplore.ieee.org/document/9489313
- [123] D. Lee, J. So, M. AHN, J.-G. Lee, J. Kim, J. Cho, R. Oliver, V. C. Thummala, R. s. JV, S. S. Upadhya, M. I. Khan, and J. H. Kim, "Improving In-Memory Database Operations with Acceleration DIMM (AxDIMM)," in Proceedings of the 18th International Workshop on Data Management on New Hardware, ser. DaMON '22. Philadelphia, PA, USA: Association for Computing Machinery, Jun. 2022, pp. 2:1–2:9. [Online]. Available: https://doi.org/10.1145/3533737.3535093
- [124] S. Lee, S.-h. Kang, J. Lee, H. Kim, E. Lee, S. Seo, H. Yoon, S. Lee, K. Lim, H. Shin, J. Kim, O. Seongil, A. Iyer, D. Wang, K. Sohn, and N. S. Kim, "Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology: Industrial Product," in Proceedings of the ACM/IEEE 48th Annual International Symposium on Computer Architecture, ser. ISCA '21. Virtual Event: IEEE Press, Jun. 2021, pp. 43–56. [Online]. Available: https://ieeexplore.ieee.org/document/9499894
- [125] Y.-C. Kwon, S. H. Lee, J. Lee, S.-H. Kwon, J. M. Ryu, J.-P. Son, O. Seongil, H.-S. Yu, H. Lee, S. Y. Kim, Y. Cho, J. G. Kim, J. Choi, H.-S. Shin, J. Kim, B. Phuah, H. Kim, M. J. Song, A. Choi, D. Kim, S. Kim, E.-B. Kim, D. Wang, S. Kang, Y. Ro, S. Seo, J. Song, J. Youn, K. Sohn, and N. S. Kim, "25.4 A 20nm 6GB Function-In-Memory DRAM, Based on HBM2 with a 1.2 TFLOPS Programmable Computing Unit Using Bank-Level Parallelism, for Machine Learning Applications," in Proceedings of the IEEE International Solid-State Circuits Conference, ser. ISSCC '21, vol. 64. Virtual Event: IEEE Press, Feb. 2021, pp. 350–352. [Online]. Available: https://ieeexplore.ieee.org/document/9365862
- [126] M. Besta and T. Hoefler, "Parallel and Distributed Graph Neural Networks: An In-Depth Concurrency Analysis," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 5, pp. 2584–2606, May 2024. [Online]. Available: https://doi.org/10.1109/ TPAMI.2023.3303431
- [127] T. N. Kipf and M. Welling, "Semi-Supervised Classification with Graph Convolutional Networks," in Proceedings of the 5th International Conference on Learning Representations, ser. ICLR '17, Toulon, France, Apr. 2017, pp. 1–14. [Online]. Available: https://openreview.net/ forum?id=SJU4ayYgl

- [128] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, "Graph Attention Networks," in Proceedings of the Sixth International Conference on Learning Representations, ser. ICLR '18, Vancouver, Canada, May 2018, pp. 1–12. [Online]. Available: https://openreview.net/forum?id=rJXMpikCZ
- [129] X. Bresson and T. Laurent, "Residual Gated Graph Convnets," Apr. 2018, arXiv:1711.07553. [Online]. Available: https://arxiv.org/abs/1711.07553
- [130] L. Huang, Z. Zhang, S. Li, D. Niu, Y. Guan, H. Zheng, and Y. Xie, "Practical Near-Data-Processing Architecture for Large-Scale Distributed Graph Neural Network," IEEE Access, vol. 10, pp. 46796– 46 807, 2022. [Online]. Available: https://ieeexplore.ieee.org/document/ 9761891
- [131] S. K. Mandal, G. Krishnan, A. A. Goksoy, G. R. Nair, Y. Cao, and U. Y. Ogras, "COIN: Communication-aware In-memory Acceleration for Graph Convolutional Networks," IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 12, no. 2, pp. 472–485, Jun. 2022. [Online]. Available: https://ieeexplore.ieee.org/document/9762264
- [132] M. Yoo, J. Song, J. Lee, N. Kim, Y. Kim, and J. Lee, "SGCN: Exploiting Compressed-Sparse Features in Deep Graph Convolutional Network Accelerators," in Proceedings of the IEEE International Symposium on High-Performance Computer Architecture, ser. HPCA '23. Montreal, QC, Canada: IEEE Press, Feb. 2023, pp. 1–14. [Online]. Available: https://ieeexplore.ieee.org/document/10071102
- [133] Y. Huang, L. Zheng, P. Yao, Q. Wang, X. Liao, H. Jin, and J. Xue, "Accelerating Graph Convolutional Networks using Crossbar-based Processing-in-memory Architectures," in Proceedings of the IEEE International Symposium on High-Performance Computer Architecture, ser. HPCA '22. Virtual Event, South Korea: IEEE Press, Apr. 2022, pp. 1029–1042. [Online]. Available: https://ieeexplore.ieee.org/document/ 9773267
- [134] Z. Zhou, C. Li, X. Wei, X. Wang, and G. Sun, "GNNear: Accelerating Full-Batch Training of Graph Neural Networks with Near-Memory Processing," in Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, ser. PACT '22. Association for Computing Machinery, Oct. 2022, pp. 54–68. [Online]. Available: https://doi.org/10.1145/3559009.3569670
- [135] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. Hruschka, and T. Mitchell, "Toward an Architecture for Never-Ending Language Learning," Proceedings of the AAAI Conference on Artificial Intelligence, vol. 24, no. 1, pp. 1306–1313, Jul. 2010. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/7519
- [136] R. Recio, B. Metzler, P. Culley, J. Hilland, and D. Garcia, "A Remote Direct Memory Access Protocol Specification," Internet Engineering Task Force, Tech. Rep., Oct. 2007, RFC 5040.
- [137] R. Gerstenberger, M. Besta, and T. Hoefler, "Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided," in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ser. SC '13. Denver, CO, USA: Association for Computing Machinery, Nov. 2013, pp. 53:1– 53:12. [Online]. Available: https://doi.org/10.1145/2503210.2503286
- [138] C. Mitchell, Y. Geng, and J. Li, "Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key-Value Store," in Proceedings of the 2013 USENIX Annual Technical Conference, ser. USENIX ATC '13. San Jose, CA, USA: USENIX Association, Jun. 2013, pp. 103–114. [Online]. Available: https://www.usenix.org/system/files/conference/atc13/atc13mitchell.pdf
- [139] Y. Wang, L. Zhang, J. Tan, M. Li, Y. Gao, X. Guerin, X. Meng, and S. Meng, "HydraDB: A Resilient RDMA-Driven Key-Value Middleware for In-Memory Cluster Computing," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '15. Austin, TX, USA: Association for Computing Machinery, Nov. 2015, pp. 22:1–22:11. [Online]. Available: https://doi.org/10.1145/2807591.2807614
- [140] A. Kalia, M. Kaminsky, and D. G. Andersen, "Using RDMA Efficiently for Key-Value Services," SIGCOMM Comput. Commun. Rev., vol. 44, no. 4, pp. 295–306, Aug. 2014. [Online]. Available: https://doi.org/10.1145/2740070.2626299
- [141] A. K. Simpson, A. Szekeres, J. Nelson, and I. Zhang, "Securing RDMA for High-Performance Datacenter Storage Systems," in Proceedings of the 12th USENIX Workshop on Hot Topics in Cloud Computing, ser. HotCloud '20. Virtual Event: USENIX Association, Jul. 2020. [Online]. Available: https://www.usenix.org/conference/hotcloud20/presentation/ kornfeld-simpson

- [142] J. Huang, X. Ouyang, J. Jose, M. Wasi-ur Rahman, H. Wang, M. Luo, H. Subramoni, C. Murthy, and D. K. Panda, "High-Performance Design of HBase with RDMA over Infiniband," in Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium, ser. IPDPS '12. Shanghai, China: IEEE Press, May 2012, pp. 774–785. [Online]. Available: https://ieeexplore.ieee.org/document/6267886
- [143] N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy, and D. K. Panda, "High Performance RDMA-Based Design of HDFS over InfiniBand," in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ser. SC '12. Salt Lake City, UT, USA: IEEE Press, Nov. 2012, pp. 1–12. [Online]. Available: https://ieeexplore.ieee.org/document/6468497
- [144] X. Lu, N. S. Islam, M. Wasi-Ur-Rahman, J. Jose, H. Subramoni, H. Wang, and D. K. Panda, "High-Performance Design of Hadoop RPC with RDMA over InfiniBand," in Proceedings of the 42nd International Conference on Parallel Processing, ser. ICPP '13. Lyon, France: IEEE Press, Oct. 2013, pp. 641–650. [Online]. Available: https://ieeexplore.ieee.org/document/6687402
- [145] T. S. Woodall, G. M. Shipman, G. Bosilca, R. L. Graham, and A. B. Maccabe, "High Performance RDMA Protocols in HPC," in Recent Advances in Parallel Virtual Machine and Message Passing Interface (EuroPVM/MPI '06), ser. Lecture Notes in Computer Science (LNCS), B. Mohr, J. L. Träff, J. Worringen, and J. Dongarra, Eds. Bonn, Germany: Springer Berlin Heidelberg, Sep. 2006, vol. 4192, pp. 76–85. [Online]. Available: https://link.springer.com/chapter/10.1007/ 11846802\_18
- [146] M. Poke and T. Hoefler, "DARE: High-Performance State Machine Replication on RDMA Networks," in Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, ser. HPDC '15. Portland, OR, USA: Association for Computing Machinery, Jun. 2015, pp. 107–118. [Online]. Available: https://doi.org/10.1145/2749246.2749267
- [147] J. Liu, J. Wu, and D. K. Panda, "High Performance RDMA-Based MPI Implementation over InfiniBand," International Journal of Parallel Programming, vol. 32, no. 3, pp. 167–198, Jun. 2004. [Online]. Available: https://link.springer.com/article/10.1023/B:IJPP.0000029272.69895.c1
- [148] A. Kalia, M. Kaminsky, and D. G. Andersen, "Design Guidelines for High Performance RDMA Systems," in Proceedings of the 2016 USENIX Annual Technical Conference, ser. USENIX ATC '16. Denver, CO, USA: USENIX Association, Jun. 2016, pp. 437–450. [Online]. Available: https://www.usenix.org/conference/atc16/technical-sessions/ presentation/kalia
- [149] M. Besta and T. Hoefler, "Active Access: A Mechanism for High-Performance Distributed Data-Centric Computations," in Proceedings of the 29th ACM on International Conference on Supercomputing, ser. ICS '15. Newport Beach, CA, USA: Association for Computing Machinery, Jun. 2015, pp. 155–164. [Online]. Available: https: //doi.org/10.1145/2751205.2751219
- [150] —, "Fault Tolerance for Remote Memory Access Programming Models," in Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing, ser. HPDC '14. Vancouver, BC, Canada: Association for Computing Machinery, Jun. 2014, pp. 37–48. [Online]. Available: https://doi.org/10.1145/ 2600212.2600224
- [151] P. Schmid, M. Besta, and T. Hoefler, "High-Performance Distributed RMA Locks," in Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, ser. HPDC '16. Kyoto, Japan: Association for Computing Machinery, Jun. 2016, pp. 19–30. [Online]. Available: https://doi.org/10.1145/2907294.2907323
- [152] M. Besta, J. Domke, M. Schneider, M. Konieczny, S. Di Girolamo, T. Schneider, A. Singla, and T. Hoefler, "High-Performance Routing with Multipathing and Path Diversity in Ethernet and HPC Networks," IEEE Transactions on Parallel and Distributed Systems, vol. 32, no. 4, pp. 943–959, Apr. 2021. [Online]. Available: https://doi.org/10.1109/ TPDS.2020.3035761
- [153] M. Besta, M. Schneider, M. Konieczny, K. Cynk, E. Henriksson, S. Di Girolamo, A. Singla, and T. Hoefler, "FatPaths: Routing in Supercomputers and Data Centers When Shortest Paths Fall Short," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '20. Atlanta, GA, USA: IEEE Press, Nov. 2020, pp. 27:1–27:18. [Online]. Available: https://doi.org/10.1109/SC41405.2020.00031

- [154] M. Burke, S. Dharanipragada, S. Joyner, A. Szekeres, J. Nelson, I. Zhang, and D. R. K. Ports, "PRISM: Rethinking the RDMA Interface for Distributed Systems," in Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, ser. SOSP '21. Virtual Event, Germany: Association for Computing Machinery, Oct. 2021, pp. 228–242. [Online]. Available: https://doi.org/10.1145/3477132.3483587
- [155] S. Jha, J. Behrens, T. Gkountouvas, M. Milano, W. Song, E. Tremel, R. V. Renesse, S. Zink, and K. P. Birman, "Derecho: Fast State Machine Replication for Cloud Services," ACM Transactions on Computer Systems, vol. 36, no. 2, pp. 4:1–4:49, Apr. 2019. [Online]. Available: https://doi.org/10.1145/3302258
- [156] D. Kim, A. Memaripour, A. Badam, Y. Zhu, H. H. Liu, J. Padhye, S. Raindel, S. Swanson, V. Sekar, and S. Seshan, "Hyperloop: Group-Based NIC-Offloading to Accelerate Replicated Transactions in Multi-Tenant Storage Systems," in Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, ser. SIGCOMM '18. Budapest, Hungary: Association for Computing Machinery, Aug. 2018, pp. 297–312. [Online]. Available: https: //doi.org/10.1145/3230543.3230572
- [157] Y. Taleb, R. Stutsman, G. Antoniu, and T. Cortes, "Tailwind: Fast and Atomic RDMA-based Replication," in Proceedings of the 2018 USENIX Annual Technical Conference, ser. USENIX ATC '18. Boston, MA, USA: USENIX Association, Jul. 2018, pp. 851–863. [Online]. Available: https://www.usenix.org/conference/atc18/presentation/taleb
- [158] E. Zamanian, X. Yu, M. Stonebraker, and T. Kraska, "Rethinking Database High Availability with RDMA Networks," Proceedings of the VLDB Endowment, vol. 12, no. 11, pp. 1637–1650, Jul. 2019. [Online]. Available: https://doi.org/10.14778/3342263.3342639
- [159] X. Wei, J. Shi, Y. Chen, R. Chen, and H. Chen, "Fast In-Memory Transaction Processing Using RDMA and HTM," in Proceedings of the 25th Symposium on Operating Systems Principles, ser. SOSP '15. Monterey, CA, USA: Association for Computing Machinery, Oct. 2015, pp. 87–104. [Online]. Available: https://doi.org/10.1145/ 2815400.2815419
- [160] A. Dragojević, D. Narayanan, M. Castro, and O. Hodson, "FaRM: Fast Remote Memory," in Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation, ser. NSDI '14. Seattle, WA, USA: USENIX Association, Apr. 2014, pp. 401– 414. [Online]. Available: https://www.usenix.org/conference/nsdi14/ technical-sessions/dragojevi%C4%87
- [161] X. Wei, Z. Dong, R. Chen, and H. Chen, "Deconstructing RDMAenabled Distributed Transactions: Hybrid is Better!" in Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation, ser. OSDI '18. Carlsbad, CA, USA: USENIX Association, Oct. 2018, pp. 233–251. [Online]. Available: https: //www.usenix.org/conference/osdi18/presentation/wei
- [162] E. Zamanian, C. Binnig, T. Harris, and T. Kraska, "The End of a Myth: Distributed Transactions Can Scale," Proc. VLDB Endow., vol. 10, no. 6, pp. 685–696, Feb. 2017. [Online]. Available: https://doi.org/10.14778/3055330.3055335
- [163] T. Ziegler, S. Tumkur Vani, C. Binnig, R. Fonseca, and T. Kraska, "Designing Distributed Tree-Based Index Structures for Fast RDMA-Capable Networks," in Proceedings of the 2019 International Conference on Management of Data, ser. SIGMOD '19. Amsterdam, Netherlands: Association for Computing Machinery, Jun. 2019, pp. 741–758. [Online]. Available: https://doi.org/10.1145/3299869.3300081
- [164] S. Di Girolamo, D. De Sensi, K. Taranov, M. Malesevic, M. Besta, T. Schneider, S. Kistler, and T. Hoefler, "Building Blocks for Network-Accelerated Distributed File Systems," in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ser. SC '22. Dallas, TX, USA: IEEE Press, Nov. 2022, pp. 10:1–10:14. [Online]. Available: https://doi.org/10.1109/ SC41404.2022.00015
- [165] C. Binnig, A. Crotty, A. Galakatos, T. Kraska, and E. Zamanian, "The End of Slow Networks: It's Time for a Redesign," Proc. VLDB Endow., vol. 9, no. 7, pp. 528–539, Mar. 2016. [Online]. Available: https://doi.org/10.14778/2904483.2904485
- [166] W. Rödiger, T. Mühlbauer, A. Kemper, and T. Neumann, "High-Speed Query Processing over High-Speed Networks," Proc. VLDB Endow., vol. 9, no. 4, pp. 228–239, Dec. 2015. [Online]. Available: https://doi.org/10.14778/2856318.2856319
- [167] S. Di Girolamo, K. Taranov, A. Kurth, M. Schaffner, T. Schneider, J. Beránek, M. Besta, L. Benini, D. Roweth, and T. Hoefler, "Network-Accelerated Non-Contiguous Memory Transfers," in Proceedings of the

International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '19. Denver, CO, USA: Association for Computing Machinery, Nov. 2019, pp. 56:1–56:14. [Online]. Available: https://doi.org/10.1145/3295500.3356189

- [168] C. Barthels, S. Loesing, G. Alonso, and D. Kossmann, "Rack-Scale In-Memory Join Processing Using RDMA," in Proceedings of the 2015 International Conference on Management of Data, ser. SIGMOD '15. Melbourne, Victoria, Australia: Association for Computing Machinery, May 2015, pp. 1463–1475. [Online]. Available: https://doi.org/10.1145/2723372.2750547
- [169] F. Li, S. Das, M. Syamala, and V. R. Narasayya, "Accelerating Relational Databases by Leveraging Remote Memory and RDMA," in Proceedings of the 2016 International Conference on Management of Data, ser. SIGMOD '16. San Francisco, CA, USA: Association for Computing Machinery, Jun. 2016, pp. 355–370. [Online]. Available: https://doi.org/10.1145/2882903.2882949
- [170] M. Besta, R. Gerstenberger, N. Blach, M. Fischer, and T. Hoefler, "GDI: A Graph Database Interface Standard," ETH Zurich, Tech. Rep., Nov. 2023, accessed: May 1, 2024. [Online]. Available: https://github.com/spcl/GDI-RMA
- [171] M. Besta, R. Gerstenberger, M. Fischer, M. Podstawski, N. Blach, B. Egeli, G. Mitenkov, W. Chlapek, M. Michalewicz, H. Niewiadomski, J. Müller, and T. Hoefler, "The Graph Database Interface: Scaling Online Transactional and Analytical Graph Workloads to Hundreds of Thousands of Cores," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '23. Denver, CO, USA: Association for Computing Machinery, Nov. 2023, pp. 22:1–22:18. [Online]. Available: https://doi.org/10.1145/ 3581784.3607068
- [172] A. Strausz, F. Vella, S. Di Girolamo, M. Besta, and T. Hoefler, "Asynchronous Distributed-Memory Triangle Counting and LCC with RMA Caching," Mar. 2022, arXiv:2202.13976. [Online]. Available: https://arxiv.org/abs/2202.13976
- [173] M. Besta, R. Kanakagiri, H. Mustafa, M. Karasikov, G. Rätsch, T. Hoefler, and E. Solomonik, "Communication-Efficient Jaccard Similarity for High-Performance Distributed Genome Comparisons," in Proceedings of the IEEE 34th International Parallel and Distributed Processing Symposium, ser. IPDPS '20, New Orleans, LA, USA, May 2020, pp. 1122–1132. [Online]. Available: https://doi.org/10.1109/ IPDPS47924.2020.00118
- [174] E. Solomonik, M. Besta, F. Vella, and T. Hoefler, "Scaling Betweenness Centrality Using Communication-Efficient Sparse Matrix Multiplication," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '17. Denver, CO, USA: Association for Computing Machinery, Nov. 2017, pp. 47:1–47:14. [Online]. Available: https://doi.org/10.1145/3126908.3126971
- [175] M. Besta, M. Podstawski, L. Groner, E. Solomonik, and T. Hoefler, "To Push or to Pull: On Reducing Communication and Synchronization in Graph Computations," in Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing, ser. HPDC '17. Washington, DC, USA: Association for Computing Machinery, Jun. 2017, pp. 93–104. [Online]. Available: https://doi.org/ 10.1145/3078597.3078616
- [176] M. Besta and T. Hoefler, "Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages," in Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, ser. HPDC '15. Portland, OR, USA: Association for Computing Machinery, Jun. 2015, pp. 161–172. [Online]. Available: https://doi.org/10.1145/2749246.2749263
- [177] MPI Forum, "MPI: A Message-Passing Interface Standard. Version 3," Sep. 2012. [Online]. Available: http://www.mpi-forum.org
- [178] H. Schweizer, M. Besta, and T. Hoefler, "Evaluating the Cost of Atomic Operations on Modern Architectures," in Proceedings of the 2015 International Conference on Parallel Architecture and Compilation, ser. PACT '15. San Francisco, CA, USA: IEEE Press, Oct. 2015, pp. 445–456. [Online]. Available: https://doi.org/10.1109/PACT.2015.24
- [179] M. Herlihy and N. Shavit, The Art of Multiprocessor Programming. Morgan Kaufmann, 2008.
- [180] K. Maschhoff, R. Vesse, and J. Maltby, "Porting the Urika-GD Graph Analytic Database to the XC30/40 Platform," in Proceedings of the Cray User Group Conference, ser. CUG '15, Chicago, IL, USA, Apr. 2015. [Online]. Available: https://cug.org/proceedings/cug2015\_proceedings/ includes/files/pap150.pdf

- [181] C. D. Rickett, U.-U. Haus, J. Maltby, and K. J. Maschhoff, "Loading and Querying a Trillion RDF Triples with Cray Graph Engine on the Cray XC," in Proceedings of the Cray User Group Conference, ser. CUG '18, Stockholm, Sweden, May 2018.
- [182] M. ten Bruggencate and D. Roweth, "DMAPP An API for One-Sided Program Models on Baker Systems," in Proceedings of the Cray User Group Conference, ser. CUG '10, Edinburgh, Scotland, May 2010, pp. 1–7. [Online]. Available: https://cug.org/5-publications/proceedings\_attendee\_lists/CUG10CD/ pages/1-program/final\_program/CUG10\_Proceedings/pages/authors/01-5Monday/03B-tenBruggencate-Paper-2.pdf
- [183] K. J. Maschhoff, R. Vesse, S. R. Sukumar, M. F. Ringenburg, and J. Maltby, "Quantifying Performance of CGE: A Unified Scalable Pattern Mining and Search System," in Proceedings of the Cray User Group Conference, ser. CUG '17, Redmond, WA, USA, May 2017, pp. 1–14. [Online]. Available: https://cug.org/proceedings/cug2017\_proceedings/ includes/files/pap179s2-file1.pdf
- [184] C. D. Rickett, K. J. Maschhoff, and S. R. Sukumar, "Optimizing the Cray Graph Engine for Performant Analytics on Cluster, SuperDome Flex, Shasta Systems and Cloud Deployment," in Proceedings of the Cray User Group Conference, ser. CUG '21, Virtual Event, May 2021, pp. 1–9. [Online]. Available: https://cug.org/proceedings/cug2021\_proceedings/ includes/files/pap114s2-file1.pdf
- [185] C. Buragohain, K. M. Risvik, P. Brett, M. Castro, W. Cho, J. Cowhig, N. Gloy, K. Kalyanaraman, R. Khanna, J. Pao, M. Renzelmann, A. Shamis, T. Tan, and S. Zheng, "A1: A Distributed In-Memory Graph Database," in Proceedings of the International Conference on Management of Data, ser. SIGMOD '20. Portland, OR, USA: Association for Computing Machinery, Jun. 2020, pp. 329–344. [Online]. Available: https://doi.org/10.1145/3318464.3386135
- [186] Infiniband Trade Association, "InfiniBand<sup>™</sup> Architecture Specification Release 1.2.1 Annex A17: RoCEv2," 2014.
- [187] Microsoft, "Bond: Cross-Platform Framework for Working with Schematized Data," 2016, accessed: May 1, 2024. [Online]. Available: https://github.com/microsoft/bond
- [188] J. Shi, Y. Yao, R. Chen, H. Chen, and F. Li, "Fast and Concurrent RDF Queries with RDMA-Based Distributed Graph Exploration," in Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation, ser. OSDI '16. Savannah, GA, USA: USENIX Association, Nov. 2016, pp. 317– 332. [Online]. Available: https://www.usenix.org/conference/osdi16/ technical-sessions/presentation/shi
- [189] Y. Zhang, R. Chen, and H. Chen, "Sub-Millisecond Stateful Stream Querying over Fast-Evolving Linked Data," in Proceedings of the 26th Symposium on Operating Systems Principles, ser. SOSP '17. Shanghai, China: Association for Computing Machinery, Oct. 2017, pp. 614–630. [Online]. Available: https://doi.org/10.1145/3132747.3132777
- [190] S. Wang, C. Lou, R. Chen, and H. Chen, "Fast and Concurrent RDF Queries Using RDMA-Assisted GPU Graph Exploration," in Proceedings of the 2018 USENIX Annual Technical Conference, ser. USENIX ATC '18. Boston, MA, USA: USENIX Association, Jul. 2018, pp. 651–664. [Online]. Available: https://www.usenix.org/conference/ atc18/presentation/wang-siyuan
- [191] Z. Yao, R. Chen, B. Zang, and H. Chen, "Wukong+ G: Fast and Concurrent RDF Query Processing Using RDMA-Assisted GPU Graph Exploration," IEEE Transactions on Parallel and Distributed Systems, vol. 33, no. 7, pp. 1619–1635, Jul. 2022. [Online]. Available: https://ieeexplore.ieee.org/document/9582823
- [192] B. Huang, L. Jin, Z. Lu, M. Yan, J. Wu, P. C. Hung, and Q. Tang, "RDMA-Driven MongoDB: An Approach of RDMA Enhanced NoSQL Paradigm for Large-Scale Data Processing," Information Sciences, vol. 502, pp. 376–393, Oct. 2019. [Online]. Available: https: //www.sciencedirect.com/science/article/abs/pii/S0020025519305869
- [193] B. Cassell, T. Szepesi, B. Wong, T. Brecht, J. Ma, and X. Liu, "Nessie: A Decoupled, Client-Driven Key-Value Store Using RDMA," IEEE Transactions on Parallel and Distributed Systems, vol. 28, no. 12, pp. 3537–3552, Dec. 2017. [Online]. Available: https: //ieeexplore.ieee.org/document/7987083
- [194] M. Yang, S. Yu, R. Yu, N. Xiao, F. Liu, and W. Chen, "InnerCache: A Tactful Cache Mechanism for RDMA-Based Key-Value Store," in Proceedings of the 2016 IEEE International Conference on Web Services, ser. ICWS '16. San Francisco, CA, USA: IEEE Press, Jun. 2016, pp. 646–649. [Online]. Available: https://ieeexplore.ieee.org/ document/7558060

- [195] N. S. Islam, D. Shankar, X. Lu, M. Wasi-Ur-Rahman, and D. K. Panda, "Accelerating I/O Performance of Big Data Analytics on HPC Clusters Through RDMA-Based Key-Value Store," in Proceedings of the 44th International Conference on Parallel Processing, ser. ICPP '15. Beijing, China: IEEE Press, Sep. 2015, pp. 280–289. [Online]. Available: https://ieeexplore.ieee.org/document/7349583
- [196] D. Shankar, X. Lu, N. Islam, M. Wasi-Ur-Rahman, and D. K. Panda, "High-Performance Hybrid Key-Value Store on Modern Clusters with RDMA Interconnects and SSDs: Non-Blocking Extensions, Designs, and Benefits," in Proceedings of the 30th IEEE International Parallel and Distributed Processing Symposium, ser. IPDPS '16. Chicago, IL, USA: IEEE Press, May 2016, pp. 393–402. [Online]. Available: https://ieeexplore.ieee.org/document/7516035
- [197] T. Ben-Nun, J. de Fine Licht, A. N. Ziogas, T. Schneider, and T. Hoefler, "Stateful Dataflow Multigraphs: A Data-Centric Model for Performance Portability on Heterogeneous Architectures," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '19. Denver, CO, USA: Association for Computing Machinery, Nov. 2019, pp. 81:1–81:14. [Online]. Available: https://doi.org/10.1145/3295500.3356173
- [198] T. Ben-Nun, B. Ates, A. Calotoiu, and T. Hoefler, "Bridging Control-Centric and Data-Centric Optimization," in Proceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization, ser. CGO '23. Association for Computing Machinery, Feb. 2023, pp. 173–185. [Online]. Available: https://doi.org/10.1145/ 3579990.3580018
- [199] O. Rausch, T. Ben-Nun, N. Dryden, A. Ivanov, S. Li, and T. Hoefler, "A Data-Centric Optimization Framework for Machine Learning," in Proceedings of the 36th ACM International Conference on Supercomputing, ser. ICS '22. Virtual Event: Association for Computing Machinery, Jun. 2022, pp. 36:1–36:13. [Online]. Available: https://doi.org/10.1145/3524059.3532364
- [200] G. Kwaśniewski, M. Kabic, T. Ben-Nun, A. N. Ziogas, J. E. Saethre, A. Gaillard, T. Schneider, M. Besta, A. Kozhevnikov, J. VandeVondele, and T. Hoefler, "On the Parallel I/O Optimality of Linear Algebra Kernels: Near-Optimal Matrix Factorizations," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '21. St. Louis, MO, USA: Association for Computing Machinery, Nov. 2021, pp. 70:1–70:15. [Online]. Available: https://doi.org/10.1145/3458817.3476167
- [201] J. Bazinska, A. Ivanov, T. Ben-Nun, N. Dryden, M. Besta, S. Shen, and T. Hoefler, "Cached Operator Reordering: A Unified View for Fast GNN Training," Aug. 2023, arXiv:2308.12093. [Online]. Available: https://arxiv.org/abs/2308.12093
- [202] T. Hoefler, S. Di Girolamo, K. Taranov, R. E. Grant, and R. Brightwell, "sPIN: High-performance streaming Processing In the Network," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '17. Denver, CO, USA: Association for Computing Machinery, Nov. 2017, pp. 59:1– 59:16. [Online]. Available: https://doi.org/10.1145/3126908.3126970
- [203] T. Li, J. Hou, J. Yan, R. Liu, H. Yang, and Z. Sun, "Chiplet Heterogeneous Integration Technology—Status and Challenges," Electronics, vol. 9, no. 4, pp. 670:1–670:12, 2020. [Online]. Available: https://www.mdpi.com/2079-9292/9/4/670
- [204] J. H. Lau, "Chiplet Heterogeneous Integration," in Semiconductor Advanced Packaging. Springer Singapore, May 2021, pp. 413–439. [Online]. Available: https://link.springer.com/chapter/10.1007/978-981-16-1376-0\_9
- [205] G. Bell, D. H. Bailey, J. Dongarra, A. H. Karp, and K. Walsh, "A Look Back on 30 Years of the Gordon Bell Prize," The International Journal of High Performance Computing Applications, vol. 31, no. 6, pp. 469–484, Nov. 2017. [Online]. Available: https://journals.sagepub.com/doi/abs/10.1177/1094342017738610
- [206] R. Kannan, P. Sao, H. Lu, J. Kurzak, G. Schenk, Y. Shi, S. Lim, S. Israni, V. Thakkar, G. Cong, R. Patton, S. E. Baranzini, R. Vuduc, and T. Potok, "Exaflops Biomedical Knowledge Graph Analytics," in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ser. SC '22. Dallas, TX, USA: IEEE Press, Nov. 2022, pp. 6:1–6:11. [Online]. Available: https://ieeexplore.ieee.org/document/10046083
- [207] G. H. Loh, M. J. Schulte, M. Ignatowski, V. Adhinarayanan, S. Aga, D. Aguren, V. Agrawal, A. M. Aji, J. Alsop, P. Bauman, B. M. Beckmann, M. V. Beigi, S. Blagodurov, T. Boraten, M. Boyer, W. C. Brantley, N. Chalmers, S. Chen, K. Cheng, M. L. Chu,

D. Cownie, N. Curtis, J. Del Pino, N. Duong, A. Duundefinedu, Y. Eckert, C. Erb, C. Freitag, J. L. Greathouse, S. Gurumurthi, A. Gutierrez, K. Hamidouche, S. Hossamani, W. Huang, M. Islam, N. Jayasena, J. Kalamatianos, O. Kayiran, J. Kotra, A. Lee, D. Lowell, N. Madan, A. Majumdar, N. Malaya, S. Manne, S. Mashimo, D. McDougall, E. Mednick, M. Mishkin, M. Nutter, I. Paul, M. Poremba, B. Potter, K. Punniyamurthy, S. Puthoor, S. E. Raasch, K. Rao, G. Rodgers, M. Scrbak, M. Seyedzadeh, J. Slice, V. Sridharan, R. van Oostrum, E. van Tassell, A. Vishnu, S. Wasmundt, M. Wilkening, N. Wolfe, M. Wyse, A. Yalavarti, and D. Yudanov, "A Research Retrospective on AMD's Exascale Computing Journey," in Proceedings of the 50th Annual International Symposium on Computer Architecture, ser. ISCA '23. Orlando, FL, USA: Association for Computing Machinery, Jun. 2023, pp. 81:1–81:14. [Online]. Available: https://doi.org/10.1145/3579371.3589349

- [208] S. Naffziger, N. Beck, T. Burd, K. Lepak, G. H. Loh, M. Subramony, and S. White, "Pioneering Chiplet Technology and Design for the AMD EPYC<sup>™</sup> and Ryzen<sup>™</sup> Processor Families," in Proceedings of the 48th Annual International Symposium on Computer Architecture, ser. ISCA '21. Virtual Event, Spain: IEEE Press, Jun. 2021, pp. 57–70. [Online]. Available: https://doi.org/10.1109/ISCA52012.2021.00014
- [209] M. Orenes-Vera, E. Tureci, D. Wentzlaf, and M. Martonosi, "Massive Data-Centric Parallelism in the Chiplet Era," Aug. 2023, arXiv:2304.09389. [Online]. Available: https://arxiv.org/abs/2304.09389
- [210] S. Pal, J. Liu, I. Alam, N. Cebry, H. Suhail, S. Bu, S. S. Iyer, S. Pamarti, R. Kumar, and P. Gupta, "Designing a 2048-Chiplet, 14336-Core Waferscale Processor," in Proceedings of the 58th ACM/IEEE Design Automation Conference, ser. DAC '21. Virtual Event: IEEE Press, Dec. 2021, pp. 1183–1188. [Online]. Available: https://ieeexplore.ieee.org/document/9586194
- [211] A. Narayan, Y. Thonnart, P. Vivet, C. F. Tortolero, and A. K. Coskun, "WAVES: Wavelength Selection for Power-Efficient 2.5D-Integrated Photonic NoCs," in Proceedings of the Design, Automation & Test in Europe Conference & Exhibition, ser. DATE '19. Grenoble, France: IEEE Press, Mar. 2019, pp. 516–521. [Online]. Available: https://ieeexplore.ieee.org/document/8715036
- [212] A. Narayan, Y. Thonnart, P. Vivet, A. Joshi, and A. K. Coskun, "System-Level Evaluation of Chip-Scale Silicon Photonic Networks for Emerging Data-Intensive Applications," in Proceedings of the Design, Automation & Test in Europe Conference & Exhibition, ser. DATE '20. Grenoble, France: IEEE Press, Mar. 2020, pp. 1444–1449. [Online]. Available: https://ieeexplore.ieee.org/document/9116496
- [213] W. J. Dally and B. Towles, "Route Packets, Not Wires: On-Chip Interconnection Networks," in Proceedings of the 38th Annual Design Automation Conference, ser. DAC '01. Las Vegas, NV, USA: IEEE Press, Jun. 2001, pp. 684–689. [Online]. Available: https://ieeexplore.ieee.org/document/935594
- [214] E. Salminen, A. Kulmala, and T. D. Hämäläinen, "Survey of Networkon-Chip Proposals," Tampere University of Technology, Tech. Rep., Mar. 2008.
- [215] Y. Wu, C. Lu, and Y. Chen, "A Survey of Routing Algorithm for Mesh Network-on-Chip," Frontiers of Computer Science, vol. 10, no. 4, pp. 591–601, Aug. 2016. [Online]. Available: https://link.springer.com/ article/10.1007/s11704-016-5431-8
- [216] M. Besta, S. M. Hassan, S. Yalamanchili, R. Ausavarungnirun, O. Mutlu, and T. Hoefler, "Slim NoC: A Low-Diameter On-Chip Network Topology for High Energy Efficiency and Scalability," SIGPLAN Not., vol. 53, no. 2, pp. 43–55, Mar. 2018. [Online]. Available: https://doi.org/10.1145/3296957.3177158
- [217] P. Iff, M. Besta, M. Cavalcante, T. Fischer, L. Benini, and T. Hoefler, "Sparse Hamming Graph: A Customizable Networkon-Chip Topology," in Proceedings of the 2023 60th ACM/IEEE Design Automation Conference, ser. DAC '23. San Francisco, CA, USA: IEEE Press, Jul. 2023, pp. 1–6. [Online]. Available: https://doi.org/10.1109/DAC56929.2023.10247754
- [218] P. Yao, L. Zheng, Y. Huang, Q. Wang, C. Gui, Z. Zeng, X. Liao, H. Jin, and J. Xue, "ScalaGraph: A Scalable Accelerator for Massively Parallel Graph Processing," in Proceedings of the IEEE International Symposium on High-Performance Computer Architecture, ser. HPCA '22. Virtual Event, South Korea: IEEE Press, Apr. 2022, pp. 199–212. [Online]. Available: https://ieeexplore.ieee.org/document/9773208
- [219] A. Auten, M. Tomei, and R. Kumar, "Hardware Acceleration of Graph Neural Networks," in Proceedings of the 57th ACM/IEEE Design Automation Conference, ser. DAC '20. Virtual Event: IEEE Press, Jun.

2020, pp. 1-6. [Online]. Available: https://ieeexplore.ieee.org/document/ 9218751

- [220] D. J. Watts and S. H. Strogatz, "Collective Dynamics of 'Small-World' Networks," Nature, vol. 393, no. 6684, pp. 440–442, Jun. 1998. [Online]. Available: https://www.nature.com/articles/30918
- [221] D. Choudhury, R. Barik, A. S. Rajam, A. Kalyanaraman, and P. P. Pande, "Software/Hardware Co-design of 3D NoC-based GPU Architectures for Accelerated Graph Computations," ACM Trans. Des. Autom. Electron. Syst., vol. 27, no. 6, pp. 61:1–61:22, Jun. 2022. [Online]. Available: https://doi.org/10.1145/3514354
- [222] K. Duraisamy, H. Lu, P. P. Pande, and A. Kalyanaraman, "High-Performance and Energy-Efficient Network-on-Chip Architectures for Graph Analytics," ACM Trans. Embed. Comput. Syst., vol. 15, no. 4, pp. 66:1–66:26, Sep. 2016. [Online]. Available: https://doi.org/10.1145/ 2961027
- [223] S. Leung, P. Fisher, and M. Shanblatt, "A Conceptual Framework for ASIC Design," Proceedings of the IEEE, vol. 76, no. 7, pp. 741–755, Jul. 1988. [Online]. Available: https://ieeexplore.ieee.org/document/7141

...