YsummarY, use Tab ↹, Return/Enter and go back (⌘ + ←) to navigate.

Kernel Recipes 2024 - How CERN serves 1EB of data via FUSE

YouTube Video

Summary of the YouTube Video Transcript:

This transcript is from a presentation given at Kel recipes by a speaker from CERN’s IT Storage and Data Management Group. The presentation provides an overview of how CERN utilizes the Fuse subsystem to serve data to its vast user base, emphasizing that the projects discussed are the result of long-term efforts by many colleagues.

The presentation begins with an introduction to CERN, highlighting its role as an international organization for nuclear and particle physics research, its member states, and its location on the Swiss-French border. CERN’s mission is to understand fundamental particles and laws of nature, achieved through three pillars: accelerators, detectors, and computing.

Accelerators: The speaker focuses on the Large Hadron Collider (LHC), a 27km circumference accelerator. The LHC is part of a complex system of accelerators, starting from Linac, then Proton Synchrotron, Booster, SPS, and finally the LHC ring itself. Particles reach speeds close to the speed of light and circulate thousands of times per second. Beyond acceleration, CERN also has decelerators like the Antiproton Decelerator (AD) for antimatter studies.

Detectors: These are described as large, complex 3D cameras that measure particle position, charge, and momentum at a rate of 40 million times per second. The speaker details the four main LHC experiments: ATLAS, CMS, ALICE, and LHCb, noting their different focuses and designs. ATLAS and CMS are general-purpose detectors, while LHCb focuses on B-physics and ALICE on heavy-ion collisions. A simplified cross-section of the CMS detector is explained, illustrating layers like the silicon tracker, electromagnetic calorimeter, hadron calorimeter, and muon spectrometer, and how they interact with different types of particles. The immense data volume is stressed, with future LHC phases expecting data rates up to 51 terabits per second.

Data Flow and EOS: The presentation outlines the data flow from the ALICE detector as an example. Initial data rates are reduced by triggers, processed by GPUs and nodes to select interesting events, and then sent to EOS (Elastic Object Store), a distributed file system at CERN. EOS holds petabytes of data and feeds into tape archives for long-term storage. Fuse plays a crucial role in this ecosystem, enabling users to access data from EOS for physics analysis and processing via batch jobs as if it were a local POSIX file system.

EOS Infrastructure: Hardware details of the ALICE O2 EOS instance are provided, including its 180 petabyte raw space (150 petabytes usable with 10+2 erasure coding), 12,000 hard drives, and server configurations. Performance benchmarks are mentioned, reaching 380 GB/s write and 1.3 TB/s read speeds. Real-world peak data rates observed in monitoring are shown, highlighting sustained periods of high activity. CERN’s tape libraries and their capacity (15-60 GB/s to tape, recent record of 45+ petabytes archived) are also mentioned. The presentation touches upon the exabyte scale of data read via Fuse from EOS systems in the past year (18% of total data read), emphasizing Fuse’s importance for user access and data analysis.

Fuse Usage and XRootD: The speaker explains how Fuse is used for batch system access, user home directories, and project spaces, making data accessible via standard POSIX paths. Authentication is handled via Kerberos. While Fuse is heavily used (especially for file operations), a significant portion of data access still goes through XRootD (XRD). XRootD is introduced as a project for scalable remote data access, initially developed at SLAC for the Babar experiment. It’s described as similar to curl, Nginx, and Varnish combined, supporting its own stateful, POSIX-like protocol with features like vector reads and writes over the network, TLS, proxy/caching capabilities, and various authentication methods (Kerberos, certificates, tokens). Crucially, XRootD is not a file system but serves files from underlying file systems (POSIX, Ceph, Lustre) over the network. It includes clustering support (cmsd daemon) for creating uniform namespaces and load balancing, widely used in high-energy physics communities like CMS (AAA Federation), OSG, and WLCG. Beyond file systems, XRootD is also utilized for other applications, like the Vera Rubin Observatory’s petabyte-scale MySQL cluster. Monitoring capabilities of XRootD are essential for system management.

EOS Details: EOS is described as built on top of XRootD, leveraging its framework for protocol, redirection, and third-party copy. EOS was designed for cost-effectiveness, mixing hardware generations, and efficient resource sharing for thousands of users. It offers features like ACLs, quality of service (QoS) for experiments vs. users, and file placement policies. The architecture involves redundant namespace servers (using QuarkDB on RocksDB with Redis protocol) and numerous storage servers with JBODs. Multiple EOS instances are deployed at CERN, each dedicated to specific experiments or user groups. Challenges related to user behavior (e.g., directory listing overload) and protection mechanisms are discussed. File placement, rebalancing, and cleanup policies are also mentioned as key management aspects.

Fuse Client Deep Dive: The presentation delves into the Fuse client used with EOS. It originated from a simpler XRootD Fuse client and has evolved significantly for production use. It operates in an “inverted” way, acting as a client to the EOS namespace server. It uses the XRootD protocol for communication, even for Fuse-based access. Operations are broadcast to all connected clients, with eventual consistency mechanisms in place. A local journal and caching are used to optimize network traffic. Past issues with inode invalidation and kernel interactions are mentioned, along with resolutions in newer EOS versions. Vector read limitations in Fuse and workarounds using root software are discussed. The idea of shared metadata caches between kernel and user space is proposed for potential Fuse client improvements. Despite limitations, Fuse performance is generally sufficient, with bottlenecks often residing in the namespace server or client-side.

CVMFS for Software Distribution: The presentation briefly introduces CVMFS (CernVM File System), another Fuse-based file system used at CERN, but developed by a different department. CVMFS is read-only and HTTP-based, designed for distributing experiment software across the grid. It uses a CDN architecture, content-addressed storage for deduplication, and is increasingly used for container image distribution. The speaker highlights the combination of CVMFS with Gentoo Prefix for HPC software distribution, referencing Compute Canada and EasyBuild’s adoption of this approach.

Storage Landscape and Open Source: Finally, a comprehensive overview of CERN’s storage landscape is presented, showing XRootD at the base, EOS built upon it, and various services like CERNBox, SWAN, CERN Open Data, and middleware like Rucio and FTS utilizing EOS. Ceph and AFS are also mentioned as part of the broader storage infrastructure. The presentation concludes by emphasizing that all mentioned projects are open source and encourages viewers to explore them on GitHub and GitLab, referencing recent project events and highlighting real-world usage examples, such as Jump Trading using CVMFS for data distribution. The presentation ends with a Q&A session covering network details, node placement, file access statistics, disk failure rates, and kernel upstreaming.

Accuracy of Information:

The information presented in the transcript appears to be largely accurate based on established knowledge about CERN, its infrastructure, and the technologies mentioned. Here’s a point-by-point check:

  • CERN and LHC: The description of CERN’s mission, member states, location, and the LHC’s role in particle physics research is consistent with publicly available information from CERN’s official website and other reputable sources. The LHC’s circumference (27km), particle speeds close to the speed of light, and collision rates are also accurate.
  • Detectors (ATLAS, CMS, ALICE, LHCb): The descriptions of the four main detectors, their general purpose or specialized nature, and the basic principles of particle detection (tracker, calorimeters, muon spectrometers) are accurate and align with descriptions found in CERN publications and experiment websites. The mention of the silicon tracker being similar to camera chips and the explanation of particle interactions within the detector are simplified but conceptually correct.
  • Data Rates and Storage: The data rates mentioned (terabits/second, petabytes, exabytes) are in the expected range for LHC experiments. The scale of data and the challenges of handling it are accurately portrayed. The EOS capacity figures and performance benchmarks seem plausible, although detailed specifications would require further verification from CERN’s IT documentation.
  • EOS, XRootD, CVMFS: The descriptions of EOS as a distributed file system built on XRootD, XRootD as a remote data access protocol, and CVMFS as a software distribution system are accurate and consistent with their project documentation and general understanding in the scientific computing community. The features and functionalities described (erasure coding, vector reads, caching, clustering, content-addressed storage) are all known characteristics of these systems. The explanation of XRootD not being a file system itself but serving data from underlying file systems is a key and accurate distinction.
  • Fuse Usage: The explanation of Fuse’s role in providing POSIX access to EOS for users and batch jobs is accurate and reflects a common use case for Fuse in large-scale storage systems. The discussion about Kerberos authentication, eventual consistency, and performance considerations is also realistic.
  • CVMFS and HPC Software Distribution: The use of CVMFS for software distribution, especially in HPC and grid computing, is a well-established and accurate application. The connection to Gentoo Prefix and EasyBuild for HPC environment management is also a known and valid development.

Potential Minor Points for Further Verification (though unlikely to be inaccurate in the context of the presentation):

  • Specific Performance Benchmarks: The exact numbers for write/read speeds (380 GB/s, 1.3 TB/s) and tape speeds (15-60 GB/s) would require checking against specific CERN IT reports if precise accuracy is needed for a formal report. However, for a presentation overview, these are likely representative and reasonable figures.
  • Disk Failure Rates: The 0.1% disk failure rate mentioned is a general estimate. Actual failure rates can vary depending on hardware, usage patterns, and environmental factors. It’s more important to note the general order of magnitude and the continuous need for disk replacement, which is accurately conveyed.
  • Client Count (30,000): The number of active Fuse clients (30,000) is a large number and might fluctuate. It’s likely an approximate figure representing the scale of user access at CERN, which is accurately portrayed as very large.

Overall Accuracy Assessment: The transcript presents a highly accurate overview of CERN’s data management infrastructure, the role of Fuse, and the key technologies employed. The information is consistent with public knowledge and technical documentation of the systems discussed. There are no significant inaccuracies detected. For a presentation of this type, the level of detail and accuracy is excellent.

Top 5 Most Relevant Resources:

Here are 5 resources to learn more about the subjects presented in the transcript:

  1. CERN Official Website (home.cern): This is the primary resource for understanding CERN, its research, the LHC, experiments, and computing infrastructure. It offers a wealth of information, from introductory materials for the public to detailed technical documentation and news.

    • Relevance: Provides foundational knowledge about CERN, the context for the technologies discussed, and access to official information about the experiments and data challenges.
  2. EOS (Elastic Object Store) Project Website & Documentation (search for “CERN EOS”): Searching for “CERN EOS” will lead to official CERN documentation and potentially project-specific websites or GitHub repositories if they are publicly available.

    • Relevance: Offers in-depth information about EOS, its architecture, features, usage at CERN, and potentially technical details about its implementation on top of XRootD.
  3. XRootD Project Website & GitHub Repository (xrootd.org & GitHub search for “xrootd”): The official XRootD website provides documentation, tutorials, and information about the project. The GitHub repository (likely under the “xrootd” organization) contains the source code, issue tracker, and potentially further technical details.

    • Relevance: Allows for deep dives into XRootD’s protocol, architecture, features like vector reads and clustering, plugin system, and its broader use in scientific data access beyond CERN.
  4. CVMFS (CernVM File System) Project Website & Documentation (cvmfs.cern.ch): The official CVMFS website provides comprehensive documentation, tutorials, and information about CVMFS, its architecture, and usage.

    • Relevance: Provides detailed understanding of CVMFS, its read-only Fuse file system nature, HTTP-based distribution, content-addressing, and applications in software distribution for scientific computing and HPC.
  5. “Distributed Data Management at the Large Hadron Collider” (Research Papers/Review Articles): Searching on academic databases (like Google Scholar, INSPIRE-HEP, arXiv) for review articles or research papers with keywords like “LHC data management,” “CERN storage,” “distributed file systems high energy physics” will lead to academic literature providing a broader and potentially more in-depth perspective on the challenges and solutions for data management in large-scale scientific collaborations like those at CERN.

    • Relevance: Offers a wider academic context, potentially comparing different approaches to data management, discussing performance evaluations, and highlighting future trends in the field. This provides a more scholarly perspective than just project websites.

These resources together offer a comprehensive pathway to learn more about the topics covered in the transcript, ranging from the high-level context of CERN’s research to the technical details of the storage and data access technologies employed.

Next: Choosing the Best Mattress
Prev: Murphy Breaks Down Trump And Musk's 'Rampage Of Open Corruption', Likens Them To Russian Oligarchs