Wei Zhang
Computer Science Researcher at LBNL
zhangwei217245 [at] lbl [dot] gov
Work Experience
Computer Science Researcher
Lawrence Berkeley National Laboratory
Mar 2023 - Now
  • I/O Optimization for HydraGNN Ensemble Training

    • Engineered Rust-based object-centric data store replacing ADIOS used in material science data, which provides unified NDArray storage that stores multiple NDArrays per data object, improving overall I/O throughput by 54% in PyTorch-based material science ensemble training scenario.

    • Designed and implemented DataLoader coordinator around PyTorch Sampler resolving I/O contention, achieving 48% reduction in I/O wait time during concurrent ensemble GNN training.

    • Built elastic scaling framework that automated DDP/NCCL port management for 512+ GPU clusters across 64-1024 node configurations.

    Keywords: Rust High-perf Storage  |  PyTorch Ecosystem Integration  |  HPC-AI Convergence  |  MPI/gRPC Hybrid Comm  |  Exascale Data Layouts
  • IDIOMS Distributed Metadata Indexing and Querying Engine (CCGrid’24 First Author)

    • Designed IDIOMS, an index-powered, distributed metadata search system that enables high-performance affix-based metadata querying in object-centric data management (ODM) systems.

    • Engineered double-layered trie-based distributed index, supporting prefix, suffix, infix, and exact metadata searches, achieving 407× faster independent queries and 300× faster collective queries than SQLite-based alternatives.

    • Integrated DART (Distributed Adaptive Radix Tree) for deterministic query routing, reducing query communication overhead by 92%, enabling scalable metadata indexing across HPC clusters.

    • Validated IDIOMS on NERSC’s Perlmutter Supercomputer across 128 nodes and 16k CPU cores using 1M+ objects, 10M+ metadata attributes, demonstrating at least 370× better performance than state-of-the-art methods with ≤52.57% memory overhead.

    Keywords: Distributed Trie-Based Indexing  |  High-Performance Metadata Search  |  Affix-Oriented Query Optimization  |  Object-Centric Data Management (ODM)  |  Scalable Distributed Data Retrieval
  • BULKI - Binary Unified Layout for Key-value Interchange

    • Designed BULKI, a next-generation binary data format for scientific workflows, achieving 50% smaller serialized size and 100× faster parsing compared to MessagePack in metadata-intensive scenarios (1,000+ attributes).

    • Nested Data Structures - Support for recursive embedding of scalar values, arrays, and nested entities

    • Self-Describing Metadata - VLE encoded metadata ensures machine-readable parsing without schema predefinition

    • Compact Storage - Optimized binary layout reduces overhead by 37% for multi-dimensional astronmy datasets

    Keywords: Schema-less Binary Data Formats  |  Variable-Length Encoding  |  HPC I/O Optimization  |  Scientific Data Management
Visiting Researcher
The Ohio State University
Mar 2023 - Now
  • AI-powered Data Discovery for Gray Graph Engine (Collaborative Research with Hewlett Packard Enterprise (HPE))

    • Led cross-institutional team across LBNL, OSU and HPE to optimize AI inference workflows in HPE’s Cray Graph Engine (CGE), mentoring the Ph.D student at Dr. Suren Byna’s and achieving 63% faster query latency for scientific datasets through -

    • Dynamic UDF Filtering - Optimized CV inference workflow in both animal taxonomy and facial recognition use cases, enabling real-time feature extraction for 90k+ OpenImages datasets;

    • Three-Stage Caching Framework - Designed object/target/feature caching strategies, reducing redundant AI computations by 82% across 128-node clusters on Perlmutter;

    • Feature caching reduced AI inference query time from 74.1s → 0.42s (176× speedup);

    • Data ingestion overhead limited to less than 5%;

    • Paper published at IEEE BigData 2024 (acceptance rate - 19.7%)

    Keywords: Technical Leadership and Mentorship  |  AI/ML Inference Reusability  |  Computer Vision  |  PyTorch Integration
Adjunct Research Scientist
Texas Tech University
June 2021 - Now
  • Semantic Query over Scientific Datasets

    • Mentored the Ph.D. student at Dr. Yong Chen’s group on exploring the intersection of natural language processing, semantic query, LLM, RAG, and scientific data discovery. The deliverables include -

    • Kv2vec - LSTM-based vector embedding for key-value pairs in scientific metadata, reducing semantic search errors from 17.3% → 3.1% (by 80%) vs. traditional methods (published in IEEE HPEC’22).

    • PSQS - a paralle semantic metadata querying service over self-describing data formats, achieving 20% improvement in query hit rate and 15% higher recall (published in IEEE BigData 2023).

    • ICEAGE - an LLM-powered metadata search engine that outperforms traditional keyword-based and code-generation methods, achieving 98% query accuracy and 5.43× higher throughput in CPU-based environments and 29.52× in GPU-accelerated settings for scientific data retrieval (under review).

    Keywords: Technical Leadership and Mentorship  |  NLP and Embeddings  |  Parallel Semantic Search  |  LLM and RAG
Senior Member of Technical Staff
Oracle Corporation
July 2021 - Feb 2023
  • Metastore Integration with OCI Data Catalog

    • Owned and led the integration of OCI Data Catalog (DCAT) into OCI Big Data Service, ensuring seamless metadata federation across 12+ components (HDFS, Spark, Hive, etc.).

    • Designed and implemented automated DCAT enablement & disablement, ensuring zero-touch configuration for customers deploying big data clusters on OCI cloud.

    • Engineered a distributed metadata orchestration framework, reducing sync latency by 67% and enhancing data lineage tracking across 50+ OCI regions.

    • Developed a region-aware service provisioning mechanism, enabling 5× faster onboarding of new OCI regions, reducing manual operational overhead by 80%.

    Keywords: Automated Metadata Federation  |  Cloud Data Lake Governance  |  Multi-Region Service Orchestration
  • Security & Identity Management for OCI Big Data Service

    • Owned and implemented the Active Directory (AD) integration, enabling enterprise-grade authentication for all 12+ big data components, reducing enterprise onboarding time by 65%.

    • Standardized UID/GID registration and access control frameworks, ensuring 85% fewer identity conflicts in multi-tenant cloud environments.

    • Designed policy-driven user access enforcement, ensuring SOC 2, ISO 27001, and FedRAMP compliance for big data deployments across public and private OCI regions.

    • Developed role-based security automation, reducing manual security enforcement overhead by 40%, ensuring zero-trust compliance at scale.

    Keywords: Enterprise Identity & Access Management (IAM)  |  Big Data Security & Compliance  |  Role-Based Access Control (RBAC)  |  Zero-Trust Security Architecture
  • External Service Integration Framework for Customizable Cluster

    • Designed and implemented a dynamic cluster profile switching mechanism, allowing users to select and provision tailored big data environments, reducing cluster deployment time by 60%.

    • Developed a pluggable integration model, ensuring that external services (DCAT, Object Storage, etc.) dynamically adapt to different component combinations, providing future-proof extensibility as new profiles (e.g., Iceberg, Delta Lake) are introduced.

    • Architected a multi-region-aware deployment strategy, ensuring low-latency interoperability across OCI’s 50+ public regions and private cloud environments.

    • Ensured robust cross-component compatibility, allowing customers to seamlessly mix-and-match big data components without service disruption, enhancing cluster flexibility by 4×.

    Keywords: Customizable Big Data Cluster Profiles  |  Dynamic Service Orchestration  |  Multi-Region Deployment Automation  |  Future-Proofed External Integrations
Research Assistant
Data-Intensive Scalable Computing Laboratory (DISCL), Texas Tech University
Aug 2017 - May 2021
  • AI-driven Data Retention Framework (published in SC ‘21)

    • Led cross-institutional collaboration involving 2 national labs and an R1 university, resulting in -

    • Pioneered ActiveDR, an AI-driven data retention system designed for HPC environments, optimizing storage efficiency by prioritizing file retention for active users while purging inactive ones.

    • Evaluated on Titan supercomputer traces, it reduces file misses by up to 37% and retains 213% more data for active users while maintaining the same level of space utilization compared to traditional fixed-lifetime retention methods.

    Keywords: HPC Storage Management  |  Activeness-based Data Retention  |  User-Centric Storage Optimization
  • Metadata Indexing and High-Performance Data Search Engine for HPC Data Management

    • Designed and Implemented MIQS - a Metadata Indexing and Querying Service for high-performance metadata search over self-describing data formats such as HDF5, NetCDF, etc. This work enables direct metadata indexing over data while eliminating the need of setting up external databases. It achieves 172k× faster metadata searches than MongoDB-based solutions, 99% reduction in index construction time, and 75% lower memory footprint, making it an efficient and portable alternative for HPC scientific data management (Published in SC ‘19).

    • Designed and Implemented DART - a Distributed Adaptive Radix Tree for high-performance distributed affix-based keyword search, achieving 55× higher throughput than Distributed Hash Tables (DHTs) for prefix and suffix searches while ensuring balanced keyword distribution, reducing query contention and improving scalability (Published in PACT ‘18).

    Keywords: High-performance In-memory Index  |  In-memory Search Optimization  |  HPC Distributed Indexing and Search  |  Load-balanced Keyword Distribution  |  High-throughput Affix-based Search
  • Graph Partitioning Algorithm for Distributed Graph Data Databases

    • Designed and implemented AKIN - similarity-based streaming graph partitioning algorithm that improves data locality and reduces edge-cut ratio in distributed graph storage systems. By leveraging vertex similarity, AKIN reduces edge-cut ratio by up to 20% compared to FENNEL, while maintaining balanced partitions with minimal overhead, making it a superior alternative to IOGP for real-time graph partitioning (Published in CCGrid ‘18).

    • Initiated and Shaped the core idea of IOGP - the first multi-stage, online graph partitioning algorithm designed for distributed graph databases, dynamically adjusting partitions as data evolves. By leveraging vertex connectivity and degree changes, IOGP improves query performance by while maintaining balanced partitions with less than 10% overhead as compared to FENNEL (Published in HPDC ‘17).

    Keywords: Distributed Graph Databases  |  Streaming Graph Partitioning  |  Online Graph Partitioning  |  High-performance Graph OLTP Operation
  • Other Achievements

    • Secured NSF research funding by leading the successful development and submission of a competitive proposal, demonstrating expertise in grant writing and research leadership.

    • Mentored a Ph.D. student, guiding their research direction, publication strategy, and technical growth, contributing to the advancement of the research group.

    • Released two open-source software tools, enabling broader adoption of HPC research innovations, including -

    • Two software releases:

      • ActiveDR - software release of SC '21 study (DOI: 10.5281/zenodo.5168853)
      • MIQS - software release of SC '19 study (DOI: 10.11578/dc.20210322.3)
    Keywords: Translational Research  |  Research Grant Acquisition  |  Leadership and Mentoring Skills
Research Assistant
STARLab, Texas Tech University
Jan 2016 - Dec 2016
  • Scalable Geospatial Data Mining & Visualization

    • Designed and deployed a distributed geospatial data mining infrastructure using Apache Spark, HBase, and HDFS, reducing overall large-scale geospatial analytics time from 12 months to 3 months.

    • Optimized big data compression and indexing strategies, reducing storage overhead by 40% and improving query latency by 60% for spatial datasets.

    • Developed a real-time geospatial visualization system, leveraging GDAL, Redis, and NodeJS, enabling high-resolution mapping of social media demographics across 10+ regions.

    • Conducted advanced geospatial data mining and demographic analysis on 500M+ geo-tagged Twitter records across 5 years, providing insights into geospatial demographic pattern and socioeconomic trends.

    Keywords: Distributed Big Data Infrastructure  |  Big Data Geospatial Analytics  |  Distributed Machine Learning and Data Mining  |  Apache Spark  |  HBase  |  HDFS  |  Geospatial Visualization  |  GDAL  |  Redis  |  NodeJS  |  Social Media Demographic Analysis  |  Remote Sensing & Socioeconomic Insights
Senior System R&D Engineer
Beijing Serious Technology Co., Ltd
Jan 2014 - Jan 2016
  • Scalable Data API & Messaging Architecture

    • Architected & developed Meshwork, a high-performance graph-based data access API supporting MySQL & Redis, increasing query efficiency by 5× in distributed systems.

    • Designed & implemented BrookSide, a low-latency AMQP-based messaging framework using RabbitMQ, improving inter-service communication by 40%.

    • Led the development of Webshot-rest-amqp-service, a Node.js-based snapshot automation tool, enabling real-time monitoring & web scraping at 10× faster capture rates.

    • Developed PCVF, a Parameter Constraining & Validation Framework for secure REST API governance, reducing API failures by 60% in mission-critical applications.

    • Integrated DevOps best practices with Maven, Jenkins, and JUnit, streamlining RESTful API deployment and improving CI/CD efficiency by 3×.

    Keywords: Graph-Based Data Access (MySQL, Redis)  |  High-Throughput Message Processing (RabbitMQ, AMQP)  |  REST API Governance & Validation  |  Cloud-Based Web Scraping  |  CI/CD for Distributed Systems
System R&D Engineer
Sina.com Technology (China) Co., Ltd (Weibo)
Jul 2010 - May 2013
  • Weibo Platform API & Scalable Data Services

    • Designed & Unified Weibo’s REST API, setting the foundation for its scalable platform (Patented: CN103049271B), improving API consistency & usability for 100M+ users.

    • Owned & led the development of T.cn, Weibo’s high-availability URL shortening service, processing millions of daily requests with sub-ms latency.

    • Owned & led Weibo’s User Data Service, a critical data infrastructure handling billions of user interactions, ensuring 99.99% uptime & horizontal scalability.

    • Optimized API query pipelines, reducing data retrieval latency by 50% and enhancing real-time content recommendations.

    Keywords: Platform-Level API Design (REST, OAuth)  |  High-Availability Web Services (Weibo, T.cn)  |  Mass-Scale Data Infrastructure (Big Data Processing)  |  URL Shortening & Tracking (Weibo Analytics)  |  Scalable Content Distribution
Earlier Experience
Beijing JustMusic & Datuu.com Technology
Feb 2009 - Jan 2013
  • Business Data Management & Operational Systems

    • Developed & optimized a business data management system, improving operational efficiency by 3×.

    • Designed & implemented a batch processing framework, optimizing data transformation pipelines for higher throughput.

    • Developed & maintained an operation management system, ensuring seamless feature integration & data consistency.

    • Implemented a business reporting module, enabling automated, real-time business insights.

    Keywords: Business Data Management  |  Batch Processing Optimization  |  Operations & Reporting Automation  |  Enterprise Software Development
Publications
Journal Papers
2020
N. Zhao, G. Cao, W. Zhang, E. Samson, and Y. Chen. Remote sensing and social sensing for socioeconomic systems: A comparison study between nighttime lights and location-based social media at the 500 m spatial resolution. International Journal of Applied Earth Observation and Geoinformation.
2019
D. Dai, Y. Chen, P. Carns, J. Jenkins, W. Zhang, and R. Ross. Managing Rich Metadata in High-Performance Computing Systems Using a Graph Model. IEEE Transactions on Parallel and Distributed Systems.
2019
N. Zhao, W. Zhang, Y. Liu, E. Samson, Y. Chen, and G. Cao. Improving Nighttime Light Imagery With Location-Based Social Media Data. IEEE Transactions on Geoscience and Remote Sensing.
2018
N. Zhao, G. Cao, W. Zhang, and E. Samson. Tweets or nighttime lights: Comparison for preeminence in estimating socioeconomic factors. ISPRS Journal of Photogrammetry and Remote Sensing.
Conference Papers
2024
W. Zhang, H. Tang, and S. Byna. IDIOMS: Index-powered Distributed Object-centric Metadata Search for Scientific Data Management, in the Proceedings of 2024 IEEE/ACM international Symposium on Cluster, Cloud and Internet Computing (CCGrid 2024) (CCGrid 2024).
2023
C. Niu, W. Zhang, S. Byna, and Y. Chen. PSQS: Parallel Semantic Querying Service for Self-describing File Formats, in the 2023 IEEE International Conference on Big Data (IEEE BigData 2023).
2022
C. Niu, W. Zhang, S. Byna, and Y. Chen. Kv2vec: A Distributed Representation Method for Key-value Pairs from Metadata Attributes, in the Proceedings of 2022 IEEE High Performance Extreme Computing Conference (HPEC '22). (acceptance rate: 30/120=25%)
2021
W. Zhang, S. Byna, H. Sim, S. Lee, S. Vazhkudai, and Y. Chen. Exploiting User Activeness for Data Retention in HPC Systems, in the Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '21). (first-around acceptance rate: 86/365=23.6%, another 13 papers being asked for major revisions per SC’21)
2019
W. Zhang, S. Byna, C. Niu, and Y. Chen. Exploring Metadata Search Essentials for Scientific Data Management, in the Proceedings of 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC '19). (acceptance rate: 23%)
2019
W. Zhang, S. Byna, H. Tang, B. Williams, and Y. Chen. MIQS: Metadata Indexing and Querying Service for Self-Describing File Formats, in the Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '19). (first-around acceptance rate: 72/344=21%, another 15 papers being asked for major revisions per SC '19)
2018
W. Zhang, H. Tang, S. Byna, and Y. Chen. DART: Distributed Adaptive Radix Tree for Efficient Affix-Based Keyword Search on HPC Systems, in the Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques (PACT '18). (acceptance rate: 36/126=28.6%)
2018
W. Zhang, Y. Chen, and D. Dai. AKIN: A Streaming Graph Partitioning Algorithm for Distributed Graph Storage Systems, in the Proceedings of 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID '18). (acceptance rate: 20.8%)
2017
D. Dai, W. Zhang, and Y. Chen. IOGP: An Incremental Online Graph Partitioning Algorithm for Distributed Graph Databases, in the Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '17). (acceptance rate: 19%)
2016
D. Dai, Y. Chen, P. Carns, J. Jenkins, W. Zhang, and R. Ross. GraphMeta: A Graph-Based Engine for Managing Large-Scale HPC Rich Metadata, in the Proceedings of 2016 IEEE International Conference on Cluster Computing (CLUSTER '16). (acceptance rate: 39/162=24.1%)
Patents
2015
Wei Zhang Method and Apparatus for Automatic Generation of API Interface, Patent No. CN103049271B (https://patents.google.com/patent/CN103049271B/en) . Assigned to Beijing Weimeng Chuangke Network Technology Co., Ltd.. Patent granted in China
Software Releases
2020
W. Zhang, S. Byna, Y. Chen, National Science Foundation, USDOE, National Science Foundation, and National Science Foundation. MIQS v0.6 (link: https://www.osti.gov/biblio/1772233).
Thesis and Dissertation
2021
W. Zhang. Efficient scientific data discovery over self-describing file formats, Texas Tech University
Extended Abstracts
2020
S. Byna, Q. Koziol, H. Tang, W. Zhang, and Y. Chen. Searching metadata stored in self-describing file formats efficiently.
Posters
2020
W. Zhang. Activeness-based Data Retention Recommender for HPC Facilities, SC '20 ACM Graduate Student Research Competition Poster
2020
W. Zhang. Efficient Metadata Search for Scientific Data, SC '20 Doctoral Showcase Poster
2020
C. Niu, W. Zhang, S. Byna, and Y. Chen. Semantic Search for Self-Describing Scientific Data Formats, SC '20 Research Poster
2018
W. Zhang, H. Tang, S. Byna, and Y. Chen. Distributed Adaptive Radix Tree for Efficient Metadata Search on HPC Systems, SC '18 Research Poster
2018
W. Zhang, Y. Chen, and D. Dai. AKIN: A Streaming Graph Partitioning Algorithm for Distributed Graph Storage Systems, CCGRID '18 Research Poster
2017
D. Dai, W. Zhang, and Y. Chen. POSTER: IOGP: An Incremental Online Graph Partitioning for Large-Scale Distributed Graph Databases, PPoPP '17 Research Poster
Grant Experience
Collaborative Research: SHF: Medium: Redefining Metadata Search for Scientific Dataset Discovery
2021
  • Independently led blueprint planning and idea shaping sessions, empowering team members to contribute their expertise and laying a solid foundation for successful project implementation.
  • Mentored and guided team members in meticulous research task decomposition and planning, fostering their skills in resource allocation and ensuring the timely execution of project milestones.
  • Provided coaching and support to team members in developing exceptional writing skills for major sections within funding proposals, enabling them to effectively communicate project objectives, methodologies, and anticipated outcomes.
  • Assumed a coaching role in revising logistic items, guiding team members through meticulous review and refinement of project details to enhance clarity, coherence, and overall proposal quality while fostering their professional growth.
Collaborative Research - SHF - Medium - Empowering Scientific Dataset Discovery through Self-contained, Semantic, and Linked Metadata Search
2020
  • Proficiently engaged in blueprint planning and idea shaping, laying a solid foundation for successful project implementation.
  • Conducted meticulous research task decomposition and planning, ensuring efficient allocation of resources and timely execution of project milestones.
  • Demonstrated exceptional writing skills in the development of major sections within funding proposals, effectively communicating project objectives, methodologies, and anticipated outcomes.
  • Assumed responsibility for revising logistic items, meticulously reviewing and refining project details to enhance clarity, coherence, and overall proposal quality.
Science FAIR - FAIR Data Management for Scientific AI Applications
2020
  • Applied adept idea shaping techniques to refine and structure partial content, ensuring coherence and alignment with project goals within the funding proposal.
  • Demonstrated excellent writing skills in effectively conveying background information and motivation behind the proposed project, highlighting its significance and potential impact.
  • Authored a compelling section focusing on an essential research task, providing a comprehensive overview of the task's objectives, methodology, and expected outcomes, thereby strengthening the overall proposal's technical merit.
Teaching Experience
Graduate Courses
Advanced Parallel Computing
Invited Lecturer | Ohio State University
Parallel Computing
Invited Lecturer | Texas Tech University
Advanced Operating Systems
Course Project Designer and Mentor | Texas Tech University
Undergraduate Courses
Data Structures
Lab Instructor, Grader and Tutoring Session Host | Texas Tech University
Object Oriented Programming
Grader and Tutoring Session Host | Texas Tech University
Computer Architecture
Grader and Tutoring Session Host | Texas Tech University
Services
Panelist / Committee Member
2025
Committee Member
Program committee member for the 38th International Conference for High Performance Computing, Networking, Storage and Analysis (SC '25)
2024
Committee Member
Committee Member Of The 9th International Parallel Data Systems Workshop (PDSW'24, Held In Conjunction With SC24)
2024
PC Member
Program committee member for the 36th International Conference on Scientific and Statistical Database Management (SSDBM '24)
2024
PC Member
Program committee member for the 37th International Conference for High Performance Computing, Networking, Storage and Analysis (SC '24)
2023
PC Member
Program committee member for the 24th IEEE/ACM international Symposium on Cluster, Cloud and Internet Computing (CCGrid 2024)
2023
DOE ASCR Panelist
Panelist for the Advanced Scientific Computing Research (ASCR) program of Department of Energy (DOE) funding opportunity - Distributed Resilient Systems.
2022
DOE ASCR Panelist
Panelist for the Advanced Scientific Computing Research (ASCR) program of Department of Energy (DOE) funding opportunity - Management and Storage of Scientific Data
2020
PC Member
Program committee member for the 27th edition of the IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC 2020)
Invited Paper Reviewer
5th IPDPS Workshop on Extreme-Scale Storage and Analysis (ESSA 2024)
IEEE International Transactions on Parallel and Distributed Computing Systems (TPDS)
The 38th IEEE International Parallel & Distributed Processing Symposium (IPDPS '24)
The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '23)
The 52nd International Conference on Parallel Processing (ICPP '23)
IEEE International Transactions on Parallel and Distributed Computing Systems (TPDS)
IEEE International Parallel and Distributed Processing Symposium (IPDPS)
IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGrid)
International Conference on Parallel Processing (ICPP)
IEEE International Conference on Cluster Computing (Cluster)
IEEE International Conference on Cloud Computing (CLOUD)
IEEE International Conference on BigData (BigData)
International Conference on Utility and Cloud Computing (UCC)
IFIP International Conference on Network and Parallel Computing (NPC)
International Parallel Data Systems Workshop (PDSW)
International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2)
IEEE International Workshop on High-Performance Big Data and Cloud Computing (HPBDC)
IEEE Open Access
Presentations
Conference Presentations
August 2024
Distributed Affix-Based Metadata Search in Self-Describing Data Files, HDF5 User Group Meeting 2024, Chicago, IL, USA
May 2024
IDIOMS - Index-powered Distributed Object-centric Metadata Search for Scientific Data Management, IEEE CCGrid 2024, Philadelphia, PA, USA
Feb 2024
IDIOMS - Index-powered Distributed Object-centric Metadata Search for Scientific Data Management, LBNL Postdoc Symposium 2024, Berkeley, CA, USA
August 2023
Towards Self-contained Metadata Search Capability for Self-describing File Formats, HDF5 User Group Meeting 2023, Columbus, OH, USA
Nov 2021
Exploiting User Activeness for Data Retention in HPC Systems, the 33rd ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '21)
Nov 2020
Efficient Metadata Search for Scientific Data, the 32nd ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis (SC ’20)
Nov 2019
MIQS - Metadata Indexing and Querying Service for Self-describing Data Formats, the 31st ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis (SC ’19)
Nov 2018
DART - Distributed Adaptive Radix Tree for Efficient Affix-based Keyword Search on HPC Systems, the 27th International Conference on Parallel Architectures and Compilation Techniques (PACT ’18)
Nov 2018
Attributed Consistent Hashing for Heterogeneous Storage Systems, the 27th International Conference on Parallel Architectures and Compilation Techniques(PACT ’18)
Jul 2018
I/O Characteristics Discovery in Cloud Storage Systems, 2018 IEEE International Conference on Cloud Computing(Cloud ’18)
May 2018
AKIN - A Streaming Graph Partitioning Algorithm for Distributed Graph Storage Systems, 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid’18)
Research Seminar Talks
Nov 2020
Concurrent Metadata Indexing for Scientific Data Harvesting,
Aug 2020
Enabling High Throughput Concurrent In-Memory Metadata Indexing for Scientific Data Harvesting,
May 2020
A Data Retention Recommender System for HPC Facilities,
Mar 2020
A Recommender System for Promoting Scientific Research Collaboration and Data Sharing,
Nov 2019
On Scientific Data Discoverability,
Oct 2019
What Does a Bad Paper Look Like? Some Thoughts After A Paper Review,
Jun 2019
MIQS - Metadata Indexing and Querying Service for Self-Describing File Formats,
Feb 2019
Exploring Metadata Search Primitives for Scientific Data Management,
Nov 2018
Metadata Indexing and Search for Self-contained Scientific Data Management Models,
Sep 2018
Lightweight Metadata Search Service for Experimental and Observational Datasets,
Apr 2018
From Index to Metadata,
Feb 2018
Distributed Adaptive Radix Tree and Metadata Indexing,
Dec 2017
Distributed Keyword Search for Metadata,
Sep 2017
Towards Flexible and Efficient Metadata Search,
May 2017
Data Management for Large-Scale Graph-Oriented Applications,
Jan 2017
A Tutorial on CloudLab,
Nov 2016
Geospatial Data Mining on Spark and HBase,
Oct 2016
Similarity-Based Streaming Graph Partitioning for Distributed Graph Storage Systems,
Jun 2016
Similarity-Based Graph Data Placement Strategy for Graph-based Applications on HPC,
Mar 2016
An Online Graph Partitioner for Graph-Based Metadata Management System in High Performance Computing,
Dec 2015
Data Partitioning on High-Performance Graph Computing System - Motivation, Exploration and Innovation,
Video Presentations
Nov 2020
Activeness-based Data Retention Recommender for HPC Facilities (A 5-minute audio presentation for ACM Graduate Student Research Competition Poster at SC ’20),
Oct 2018
DART - Distributed Adaptive Radix Tree for Efficient Affix-based Keyword Search on HPC Systems (A 3-minute video teaser presentation for technical program presentation at PACT ’18),
Education
Texas Tech University
Aug 2014 - May 2021 | Lubbock, TX, USA
Ph.D. in Computer Science
Hebei University of Science and Technology
Sept 2003 - Jun 2007 | Shijiazhuang, Hebei, China
BSc in Computer Science
Skills
Programming Languages
Python, JavaScript, Java, C++, C#, Go, Rust, TypeScript
Server-side Development
Node.js, Django, ASP.NET Core, Spring Boot, Express.js, FastAPI
Databases
PostgreSQL, MySQL, MongoDB, Redis, Cassandra, Neo4j
Big Data Tools
Apache Spark, Hadoop, Apache Kafka, Apache Flink
Cloud Computing
AWS, Microsoft Azure, Google Cloud Platform, Docker, Kubernetes
Operating Systems
Linux,macOS, Windows
Software Engineering
Microservices Architecture, Design Patterns,DevOps, CI/CD,
Web Development
React, Angular, Vue.js, HTML5, CSS3, SASS, WebAssembly
Languages
English
Work proficiency
Mandarin Chinese
Native