Wei Zhang
Computer Science Researcher at LBNL
zhangwei217245 [at] gmail [dot] com
Work Experience
Computer Science Researcher
Lawrence Berkeley National Laboratory
Mar 2023 - Now
  • Scalable I/O Optimization for Large-scale Parallel GNN Training

    • Led the architectural design of a Rust-based NDArray data store 💾, improving I/O throughput by 31~135× in PyTorch-based graph neural network (GNN) ensemble training workflows.

    • Engineered an I/O-aware PyTorch DataLoader, significantly reducing contention and cutting I/O wait time by 99% under high-concurrency training scenarios.

    • Delivered a high-performance, production-ready system that bridges AI model scalability and I/O performance across modern HPC clusters.

    Keywords: Software Architect ┃ HPC AI Convergence ┃ Rust High-perf Storage ┃ PyTorch Ecosystem Integration
  • Exascale End-to-End Object-Centric Data Management

    • Spearheaded the evolution of team’s software release management standard, including developer experience unification, DevOps best practices and software release review process improvement. These efforts introduced better control on the pace of software development and research execution.

    • Architected a high-performance, trie-based distributed metadata search system - IDIOMS 💾, achieving 407× faster independent queries and 300× faster collective queries than SQLite, published in CCGrid 2024.

    • Designed a schema-less binary format for data serialization - up to 10% space reduction as compared to MessagePack with better API for self-guided data parsing process, presented at SC’24.

    Keywords: Technical Leadership ┃ CI/CD & DevOps in HPC ┃ Developer Experience ┃ Distributed Data Discovery
Visiting Researcher
The Ohio State University
Mar 2023 - Now
  • AI-powered Data Discovery for Gray Graph Engine (Collaborative Research with Hewlett Packard Enterprise (HPE))

    • Led cross-institutional team across LBNL, OSU and HPE to optimize AI inference workflows in HPE’s Cray Graph Engine (CGE), mentoring the Ph.D student at Dr. Suren Byna’s and achieving 63% faster query latency for scientific datasets through -

    • Dynamic UDF Filtering - Optimized CV inference workflow in both animal taxonomy and facial recognition use cases, enabling real-time feature extraction for 90k+ OpenImages datasets;

    • Three-Stage Caching Framework - Designed object/target/feature caching strategies, reducing redundant AI computations by 82% across 128-node clusters on Perlmutter;

    • Feature caching reduced AI inference query time from 74.1s → 0.42s (176× speedup);

    • Data ingestion overhead limited to less than 5%;

    • Paper published at IEEE BigData 2024 (acceptance rate - 19.7%)

    Keywords: Technical Leadership and Mentorship  |  AI/ML Inference Reusability  |  Computer Vision  |  PyTorch Integration
Adjunct Research Scientist
Texas Tech University
June 2021 - Now
  • Semantic Query over Scientific Datasets

    • Mentored the Ph.D. student at Dr. Yong Chen’s group on exploring the intersection of natural language processing, semantic query, LLM, RAG, and scientific data discovery.

    • Kv2vec - LSTM-based vector embedding for key-value pairs in scientific metadata, reducing semantic search errors from 17.3% → 3.1% (by 80%) vs. traditional methods (published in IEEE HPEC’22).

    • PSQS - a paralle semantic metadata querying service over self-describing data formats, achieving 20% improvement in query hit rate and 15% higher recall (published in IEEE BigData 2023).

    • ICEAGE - an LLM-powered metadata search engine that outperforms traditional keyword-based and code-generation methods, achieving 98% query accuracy and 5.43× higher throughput in CPU-based environments and 29.52× in GPU-accelerated settings for scientific data retrieval (under review).

    Keywords: Technical Leadership and Mentorship  |  NLP and Embeddings  |  Parallel Semantic Search  |  LLM and RAG
Senior Member of Technical Staff
Oracle Corporation
July 2021 - Feb 2023
  • Metastore Integration with OCI Data Catalog

    • Owned and led the integration of OCI Data Catalog (DCAT) into OCI Big Data Service, ensuring seamless metadata federation across 12+ components (HDFS, Spark, Hive, etc.).

    • Designed and implemented automated DCAT enablement & disablement, ensuring zero-touch configuration for customers deploying big data clusters on OCI cloud.

    • Engineered a distributed metadata orchestration framework, reducing sync latency by 67% and enhancing data lineage tracking across 50+ OCI regions.

    Keywords: Automated Metadata Federation  |  Cloud Data Lake Governance  |  Multi-Region Service Orchestration
  • Security & Identity Management for OCI Big Data Service

    • Owned and implemented the Active Directory (AD) integration, enabling enterprise-grade authentication for all 12+ big data components, reducing enterprise onboarding time by 65%.

    • Standardized UID/GID registration and access control frameworks, ensuring 85% fewer identity conflicts in multi-tenant cloud environments.

    Keywords: Enterprise Identity & Access Management (IAM)  |  Big Data Security & Compliance  |  Role-Based Access Control (RBAC)
  • External Service Integration Framework for Customizable Cluster

    • Designed and implemented a dynamic cluster profile switching mechanism, allowing users to select and provision tailored big data environments, reducing cluster deployment time by 60%.

    • Developed a pluggable integration model, ensuring that external services (DCAT, Object Storage, etc.) dynamically adapt to different component combinations, providing future-proof extensibility as new profiles (e.g., Iceberg, Delta Lake) are introduced.

    • Architected a multi-region-aware deployment strategy, ensuring low-latency interoperability across OCI’s 50+ public regions and private cloud environments.

    Keywords: Customizable Big Data Cluster Profiles  |  Dynamic Service Orchestration  |  Multi-Region Deployment Automation  |  Future-Proofed External Integrations
Research Assistant
Data-Intensive Scalable Computing Laboratory (DISCL), Texas Tech University
Aug 2017 - May 2021
  • AI-driven Data Retention Framework (published in SC ‘21)

    • Led cross-institutional collaboration involving 2 national labs and an R1 university, resulting in -

    • Pioneered ActiveDR, an AI-driven data retention system designed for HPC environments, optimizing storage efficiency by prioritizing file retention for active users while purging inactive ones.

    • Evaluated on Titan supercomputer traces, it reduces file misses by up to 37% and retains 213% more data for active users while maintaining the same level of space utilization compared to traditional fixed-lifetime retention methods.

    Keywords: HPC Storage Management  |  Activeness-based Data Retention  |  User-Centric Storage Optimization
  • Metadata Indexing and High-Performance Data Search Engine for HPC Data Management

    • Designed and Implemented MIQS - a Metadata Indexing and Querying Service for high-performance metadata search over self-describing data formats such as HDF5, NetCDF, etc. This work enables direct metadata indexing over data while eliminating the need of setting up external databases. It achieves 172k× faster metadata searches than MongoDB-based solutions, 99% reduction in index construction time, and 75% lower memory footprint, making it an efficient and portable alternative for HPC scientific data management (Published in SC ‘19).

    • Designed and Implemented DART - a Distributed Adaptive Radix Tree for high-performance distributed affix-based keyword search, achieving 55× higher throughput than Distributed Hash Tables (DHTs) for prefix and suffix searches while ensuring balanced keyword distribution, reducing query contention and improving scalability (Published in PACT ‘18).

    Keywords: High-performance In-memory Index  |  In-memory Search Optimization  |  HPC Distributed Indexing and Search  |  Load-balanced Keyword Distribution  |  High-throughput Affix-based Search
  • Graph Partitioning Algorithm for Distributed Graph Data Databases

    • Designed and implemented AKIN - similarity-based streaming graph partitioning algorithm that improves data locality and reduces edge-cut ratio in distributed graph storage systems. By leveraging vertex similarity, AKIN reduces edge-cut ratio by up to 20% compared to FENNEL, while maintaining balanced partitions with minimal overhead, making it a superior alternative to IOGP for real-time graph partitioning (Published in CCGrid ‘18).

    • Initiated and Shaped the core idea of IOGP - the first multi-stage, online graph partitioning algorithm designed for distributed graph databases, dynamically adjusting partitions as data evolves. By leveraging vertex connectivity and degree changes, IOGP improves query performance by 2× while maintaining balanced partitions with less than 10% overhead as compared to FENNEL (Published in HPDC ‘17).

    Keywords: Distributed Graph Databases  |  Streaming Graph Partitioning  |  Online Graph Partitioning  |  High-performance Graph OLTP Operation
  • Other Achievements

    • Secured NSF research funding by leading the successful development and submission of a competitive proposal, demonstrating expertise in grant writing and research leadership.

    • Mentored a Ph.D. student, guiding their research direction, publication strategy, and technical growth, contributing to the advancement of the research group.

    • Released two open-source software tools, enabling broader adoption of HPC research innovations, including -

    • Two software releases:

      • ActiveDR - software release of SC '21 study (DOI: 10.5281/zenodo.5168853)
      • MIQS - software release of SC '19 study (DOI: 10.11578/dc.20210322.3)
    Keywords: Translational Research  |  Research Grant Acquisition  |  Leadership and Mentoring Skills
Research Assistant
STARLab, Texas Tech University
Jan 2016 - Dec 2016
  • Scalable Geospatial Data Mining & Visualization

    • Designed and deployed a distributed geospatial data mining infrastructure using Apache Spark, HBase, and HDFS, reducing overall large-scale geospatial analytics time from 12 months to 3 months.

    • Optimized big data compression and indexing strategies, reducing storage overhead by 40% and improving query latency by 60% for spatial datasets.

    • Developed a real-time geospatial visualization system, leveraging GDAL, Redis, and NodeJS, enabling high-resolution mapping of social media demographics across 10+ regions.

    • Conducted advanced geospatial data mining and demographic analysis on 500M+ geo-tagged Twitter records across 5 years, providing insights into geospatial demographic pattern and socioeconomic trends.

    Keywords: Distributed Big Data Infrastructure  |  Big Data Geospatial Analytics  |  Distributed Machine Learning and Data Mining  |  Apache Spark  |  HBase  |  HDFS  |  Geospatial Visualization  |  GDAL  |  Redis  |  NodeJS  |  Social Media Demographic Analysis  |  Remote Sensing & Socioeconomic Insights
Senior System R&D Engineer
Beijing Serious Technology Co., Ltd
Jan 2014 - Jan 2016
  • Scalable Data API & Messaging Architecture

    • Architected & developed Meshwork, a high-performance graph-based data access API supporting MySQL & Redis, increasing query efficiency by 5× in distributed systems.

    • Designed & implemented BrookSide, a low-latency AMQP-based messaging framework using RabbitMQ, improving inter-service communication by 40%.

    • Led the development of Webshot-rest-amqp-service, a Node.js-based snapshot automation tool, enabling real-time monitoring & web scraping at 10× faster capture rates.

    • Developed PCVF, a Parameter Constraining & Validation Framework for secure REST API governance, reducing API failures by 60% in mission-critical applications.

    • Integrated DevOps best practices with Maven, Jenkins, and JUnit, streamlining RESTful API deployment and improving CI/CD efficiency by 3×.

    Keywords: Graph-Based Data Access (MySQL, Redis)  |  High-Throughput Message Processing (RabbitMQ, AMQP)  |  REST API Governance & Validation  |  Cloud-Based Web Scraping  |  CI/CD for Distributed Systems
System R&D Engineer
Sina.com Technology (China) Co., Ltd (Weibo)
Jul 2010 - May 2013
  • Weibo Platform API & Scalable Data Services

    • Designed & Unified Weibo’s REST API, setting the foundation for its scalable platform (Patented: CN103049271B), improving API consistency & usability for 500M+ users.

    • Owned & led the development of T.cn, Weibo’s high-availability URL shortening service, processing millions of daily requests with sub-ms latency.

    • Owned & led Weibo’s User Data Service, a critical data infrastructure handling billions of user interactions, ensuring 99.9% uptime & horizontal scalability.

    • Optimized API query pipelines, reducing data retrieval latency by 20% and enhancing real-time content recommendations.

    Keywords: Platform-Level API Design (REST, OAuth)  |  High-Availability Web Services (Weibo, T.cn)  |  Mass-Scale Data Infrastructure (Big Data Processing)  |  URL Shortening & Tracking (Weibo Analytics)  |  Scalable Content Distribution
Earlier Experience
Beijing JustMusic & Datuu.com Technology
Feb 2009 - Jan 2013
  • Business Data Management & Operational Systems

    • Developed & optimized a business data management system, improving operational efficiency by 3×.

    • Designed & implemented a batch processing framework, optimizing data transformation pipelines for higher throughput.

    • Developed & maintained an operation management system, ensuring seamless feature integration & data consistency.

    • Implemented a business reporting module, enabling automated, real-time business insights.

    Keywords: Business Data Management  |  Batch Processing Optimization  |  Operations & Reporting Automation  |  Enterprise Software Development
Publications
Papers
2025
W. Zhang, K. Ibrahim, and S. Byna. Optimizing Distributed Object Storage I/O for Large-scale Parallel GNN Training on Atomistic Graphs (under review)
2025
S. Saha, H. Tang, W. Zhang, and S. Byna. Distributed Metadata Querying on HPC Systems (under review)
2025
C. Niu, W. Zhang, Y. Zhao, and Y. Chen. Energy Efficient or Exhaustive? Benchmarking Power Consumption of LLM Inference Engines , in the HotCarbon Workshop on Sustainable Computer Systems 2025 (HotCarbon '25). (accepted)
2025
C. Niu, W. Zhang, M. Side, and Y. Chen. ICEAGE: Intelligent Contextual Exploration and Answer Generation Engine for Scientific Data Discovery , in the Proceedings of the 37th International Conference on Scalable Scientific Data Management (SSDBM 2025).
2024
H. Oh, W. Zhang, C. Rickett, S. Sukumar, and S. Byna. Evaluating Performance Trade-offs of Caching Strategies for AI-Powered Querying Systems , in the Proceedings of the 2024 IEEE International Conference on Big Data (IEEE BigData 2024). (Acceptance Rate: 19.7%)
2024
W. Zhang, H. Tang, and S. Byna. IDIOMS: Index-powered Distributed Object-centric Metadata Search for Scientific Data Management , in the Proceedings of 2024 IEEE/ACM international Symposium on Cluster, Cloud and Internet Computing (CCGrid 2024) (CCGrid 2024).
2023
C. Niu, W. Zhang, S. Byna, and Y. Chen. PSQS: Parallel Semantic Querying Service for Self-describing File Formats , in the 2023 IEEE International Conference on Big Data (IEEE BigData 2023).
2022
C. Niu, W. Zhang, S. Byna, and Y. Chen. Kv2vec: A Distributed Representation Method for Key-value Pairs from Metadata Attributes , in the Proceedings of 2022 IEEE High Performance Extreme Computing Conference (HPEC '22). (acceptance rate: 30/120=25%)
2021
W. Zhang, S. Byna, H. Sim, S. Lee, S. Vazhkudai, and Y. Chen. Exploiting User Activeness for Data Retention in HPC Systems , in the Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '21). (first-around acceptance rate: 86/365=23.6%, another 13 papers being asked for major revisions per SC’21)
2020
N. Zhao, G. Cao, W. Zhang, E. Samson, and Y. Chen. Remote sensing and social sensing for socioeconomic systems: A comparison study between nighttime lights and location-based social media at the 500 m spatial resolution . International Journal of Applied Earth Observation and Geoinformation.
2019
D. Dai, Y. Chen, P. Carns, J. Jenkins, W. Zhang, and R. Ross. Managing Rich Metadata in High-Performance Computing Systems Using a Graph Model . IEEE Transactions on Parallel and Distributed Systems.
2019
N. Zhao, W. Zhang, Y. Liu, E. Samson, Y. Chen, and G. Cao. Improving Nighttime Light Imagery With Location-Based Social Media Data . IEEE Transactions on Geoscience and Remote Sensing.
2019
W. Zhang, S. Byna, C. Niu, and Y. Chen. Exploring Metadata Search Essentials for Scientific Data Management , in the Proceedings of 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC '19). (acceptance rate: 23%)
2019
W. Zhang, S. Byna, H. Tang, B. Williams, and Y. Chen. MIQS: Metadata Indexing and Querying Service for Self-Describing File Formats , in the Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '19). (first-around acceptance rate: 72/344=21%, another 15 papers being asked for major revisions per SC '19)
2018
N. Zhao, G. Cao, W. Zhang, and E. Samson. Tweets or nighttime lights: Comparison for preeminence in estimating socioeconomic factors . ISPRS Journal of Photogrammetry and Remote Sensing.
2018
W. Zhang, H. Tang, S. Byna, and Y. Chen. DART: Distributed Adaptive Radix Tree for Efficient Affix-Based Keyword Search on HPC Systems , in the Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques (PACT '18). (acceptance rate: 36/126=28.6%)
2018
W. Zhang, Y. Chen, and D. Dai. AKIN: A Streaming Graph Partitioning Algorithm for Distributed Graph Storage Systems , in the Proceedings of 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID '18). (acceptance rate: 20.8%)
2017
D. Dai, W. Zhang, and Y. Chen. IOGP: An Incremental Online Graph Partitioning Algorithm for Distributed Graph Databases , in the Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '17). (acceptance rate: 19%)
2016
D. Dai, Y. Chen, P. Carns, J. Jenkins, W. Zhang, and R. Ross. GraphMeta: A Graph-Based Engine for Managing Large-Scale HPC Rich Metadata , in the Proceedings of 2016 IEEE International Conference on Cluster Computing (CLUSTER '16). (acceptance rate: 39/162=24.1%)
Patents
2015
Wei Zhang Method and Apparatus for Automatic Generation of API Interface, Patent No. CN103049271B (https://patents.google.com/patent/CN103049271B/en) . Assigned to Beijing Weimeng Chuangke Network Technology Co., Ltd.. Patent granted in China
Software Releases
2020
W. Zhang, S. Byna, Y. Chen, National Science Foundation, USDOE, National Science Foundation, and National Science Foundation. MIQS v0.6 (link: https://www.osti.gov/biblio/1772233).
Thesis and Dissertation
2021
W. Zhang. Efficient scientific data discovery over self-describing file formats, Texas Tech University
Extended Abstracts
2020
S. Byna, Q. Koziol, H. Tang, W. Zhang, and Y. Chen. Searching metadata stored in self-describing file formats efficiently.
Posters
2020
W. Zhang. Activeness-based Data Retention Recommender for HPC Facilities, SC '20 ACM Graduate Student Research Competition Poster
2020
W. Zhang. Efficient Metadata Search for Scientific Data, SC '20 Doctoral Showcase Poster
2020
C. Niu, W. Zhang, S. Byna, and Y. Chen. Semantic Search for Self-Describing Scientific Data Formats, SC '20 Research Poster
2018
W. Zhang, H. Tang, S. Byna, and Y. Chen. Distributed Adaptive Radix Tree for Efficient Metadata Search on HPC Systems, SC '18 Research Poster
2018
W. Zhang, Y. Chen, and D. Dai. AKIN: A Streaming Graph Partitioning Algorithm for Distributed Graph Storage Systems, CCGRID '18 Research Poster
2017
D. Dai, W. Zhang, and Y. Chen. POSTER: IOGP: An Incremental Online Graph Partitioning for Large-Scale Distributed Graph Databases, PPoPP '17 Research Poster
Grant Experience
Collaborative Research: SHF: Medium: Redefining Metadata Search for Scientific Dataset Discovery
2021
  • Independently led blueprint planning and idea shaping sessions, empowering team members to contribute their expertise and laying a solid foundation for successful project implementation.
  • Mentored and guided team members in meticulous research task decomposition and planning, fostering their skills in resource allocation and ensuring the timely execution of project milestones.
  • Provided coaching and support to team members in developing exceptional writing skills for major sections within funding proposals, enabling them to effectively communicate project objectives, methodologies, and anticipated outcomes.
  • Assumed a coaching role in revising logistic items, guiding team members through meticulous review and refinement of project details to enhance clarity, coherence, and overall proposal quality while fostering their professional growth.
Collaborative Research - SHF - Medium - Empowering Scientific Dataset Discovery through Self-contained, Semantic, and Linked Metadata Search
2020
  • Proficiently engaged in blueprint planning and idea shaping, laying a solid foundation for successful project implementation.
  • Conducted meticulous research task decomposition and planning, ensuring efficient allocation of resources and timely execution of project milestones.
  • Demonstrated exceptional writing skills in the development of major sections within funding proposals, effectively communicating project objectives, methodologies, and anticipated outcomes.
  • Assumed responsibility for revising logistic items, meticulously reviewing and refining project details to enhance clarity, coherence, and overall proposal quality.
Science FAIR - FAIR Data Management for Scientific AI Applications
2020
  • Applied adept idea shaping techniques to refine and structure partial content, ensuring coherence and alignment with project goals within the funding proposal.
  • Demonstrated excellent writing skills in effectively conveying background information and motivation behind the proposed project, highlighting its significance and potential impact.
  • Authored a compelling section focusing on an essential research task, providing a comprehensive overview of the task's objectives, methodology, and expected outcomes, thereby strengthening the overall proposal's technical merit.
Teaching Experience
Graduate Courses
Advanced Parallel Computing
Invited Lecturer | Ohio State University
Parallel Computing
Invited Lecturer | Texas Tech University
Advanced Operating Systems
Course Project Designer and Mentor | Texas Tech University
Undergraduate Courses
Data Structures
Lab Instructor, Grader and Tutoring Session Host | Texas Tech University
Object Oriented Programming
Grader and Tutoring Session Host | Texas Tech University
Computer Architecture
Grader and Tutoring Session Host | Texas Tech University
Services
Panelist / Committee Member
2025
Program Committee Member
Committee Member Of The 10th International Parallel Data Systems Workshop (PDSW'25, Held In Conjunction With SC25)
2025
Program Committee Member
Program committee member for the 38th International Conference for High Performance Computing, Networking, Storage and Analysis (SC '25)
2024
Program Committee Member
Committee Member Of The 9th International Parallel Data Systems Workshop (PDSW'24, Held In Conjunction With SC24)
2024
Program Committee Member
Program committee member for the 36th International Conference on Scientific and Statistical Database Management (SSDBM '24)
2024
Program Committee Member
Program committee member for the 37th International Conference for High Performance Computing, Networking, Storage and Analysis (SC '24)
2023
Program Committee Member
Program committee member for the 24th IEEE/ACM international Symposium on Cluster, Cloud and Internet Computing (CCGrid 2024)
2023
DOE ASCR Panelist
Panelist for the Advanced Scientific Computing Research (ASCR) program of Department of Energy (DOE) funding opportunity - Distributed Resilient Systems.
2022
DOE ASCR Panelist
Panelist for the Advanced Scientific Computing Research (ASCR) program of Department of Energy (DOE) funding opportunity - Management and Storage of Scientific Data
2020
Program Committee Member
Program committee member for the 27th edition of the IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC 2020)
Invited Paper Reviewer
5th IPDPS Workshop on Extreme-Scale Storage and Analysis (ESSA 2024)
IEEE International Transactions on Parallel and Distributed Computing Systems (TPDS)
The 38th IEEE International Parallel & Distributed Processing Symposium (IPDPS '24)
The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '23)
The 52nd International Conference on Parallel Processing (ICPP '23)
IEEE International Transactions on Parallel and Distributed Computing Systems (TPDS)
IEEE International Parallel and Distributed Processing Symposium (IPDPS)
IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGrid)
International Conference on Parallel Processing (ICPP)
IEEE International Conference on Cluster Computing (Cluster)
IEEE International Conference on Cloud Computing (CLOUD)
IEEE International Conference on BigData (BigData)
International Conference on Utility and Cloud Computing (UCC)
IFIP International Conference on Network and Parallel Computing (NPC)
International Parallel Data Systems Workshop (PDSW)
International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2)
IEEE International Workshop on High-Performance Big Data and Cloud Computing (HPBDC)
IEEE Open Access
Presentations
Invited Talks
May 2025
Advancing Scientific Data Discovery and Management: From Human to AI-Centric Data Cognition, NSDF All-hands Meeting 2025, San Diego, CA, USA
August 2024
Distributed Affix-Based Metadata Search in Self-Describing Data Files, HDF5 User Group Meeting 2024, Chicago, IL, USA
August 2023
Towards Self-contained Metadata Search Capability for Self-describing File Formats, HDF5 User Group Meeting 2023, Columbus, OH, USA
Conference Presentations
Feb 2025
BULKI - Binary Unified Layout for Key-Value Interchange, LBNL Postdoc Symposium 2025, Berkeley, CA, USA
Nov 2024
BULKI - Binary Unified Layout for Key-Value Interchange, PDSW WIP Session(co-located with SC'24), Atlanta, GA, USA
May 2024
IDIOMS - Index-powered Distributed Object-centric Metadata Search for Scientific Data Management, IEEE CCGrid 2024, Philadelphia, PA, USA
Feb 2024
IDIOMS - Index-powered Distributed Object-centric Metadata Search for Scientific Data Management, LBNL Postdoc Symposium 2024, Berkeley, CA, USA
Nov 2021
Exploiting User Activeness for Data Retention in HPC Systems, the 33rd ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '21)
Nov 2020
Efficient Metadata Search for Scientific Data, the 32nd ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis (SC ’20)
Nov 2019
MIQS - Metadata Indexing and Querying Service for Self-describing Data Formats, the 31st ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis (SC ’19)
Nov 2018
DART - Distributed Adaptive Radix Tree for Efficient Affix-based Keyword Search on HPC Systems, the 27th International Conference on Parallel Architectures and Compilation Techniques (PACT ’18)
Nov 2018
Attributed Consistent Hashing for Heterogeneous Storage Systems, the 27th International Conference on Parallel Architectures and Compilation Techniques(PACT ’18)
Jul 2018
I/O Characteristics Discovery in Cloud Storage Systems, 2018 IEEE International Conference on Cloud Computing(Cloud ’18)
May 2018
AKIN - A Streaming Graph Partitioning Algorithm for Distributed Graph Storage Systems, 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid’18)
Research Seminar Talks
Nov 2020
Concurrent Metadata Indexing for Scientific Data Harvesting,
Aug 2020
Enabling High Throughput Concurrent In-Memory Metadata Indexing for Scientific Data Harvesting,
May 2020
A Data Retention Recommender System for HPC Facilities,
Mar 2020
A Recommender System for Promoting Scientific Research Collaboration and Data Sharing,
Nov 2019
On Scientific Data Discoverability,
Oct 2019
What Does a Bad Paper Look Like? Some Thoughts After A Paper Review,
Jun 2019
MIQS - Metadata Indexing and Querying Service for Self-Describing File Formats,
Feb 2019
Exploring Metadata Search Primitives for Scientific Data Management,
Nov 2018
Metadata Indexing and Search for Self-contained Scientific Data Management Models,
Sep 2018
Lightweight Metadata Search Service for Experimental and Observational Datasets,
Apr 2018
From Index to Metadata,
Feb 2018
Distributed Adaptive Radix Tree and Metadata Indexing,
Dec 2017
Distributed Keyword Search for Metadata,
Sep 2017
Towards Flexible and Efficient Metadata Search,
May 2017
Data Management for Large-Scale Graph-Oriented Applications,
Jan 2017
A Tutorial on CloudLab,
Nov 2016
Geospatial Data Mining on Spark and HBase,
Oct 2016
Similarity-Based Streaming Graph Partitioning for Distributed Graph Storage Systems,
Jun 2016
Similarity-Based Graph Data Placement Strategy for Graph-based Applications on HPC,
Mar 2016
An Online Graph Partitioner for Graph-Based Metadata Management System in High Performance Computing,
Dec 2015
Data Partitioning on High-Performance Graph Computing System - Motivation, Exploration and Innovation,
Video Presentations
Nov 2020
Activeness-based Data Retention Recommender for HPC Facilities (A 5-minute audio presentation for ACM Graduate Student Research Competition Poster at SC ’20),
Oct 2018
DART - Distributed Adaptive Radix Tree for Efficient Affix-based Keyword Search on HPC Systems (A 3-minute video teaser presentation for technical program presentation at PACT ’18),
Education
Texas Tech University
Aug 2014 - May 2021 | Lubbock, TX, USA
Ph.D. in Computer Science
Hebei University of Science and Technology
Sept 2003 - Jun 2007 | Shijiazhuang, Hebei, China
BSc in Computer Science
Skills
Programming Languages
Python, JavaScript, Java, C++, C#, Go, Rust, TypeScript
Server-side Development
Node.js, Django, ASP.NET Core, Spring Boot, Express.js, FastAPI
Databases
PostgreSQL, MySQL, MongoDB, Redis, Cassandra, Neo4j
Big Data Tools
Apache Spark, Hadoop, Apache Kafka, Apache Flink
Cloud Computing
AWS, Microsoft Azure, Google Cloud Platform, Docker, Kubernetes
Operating Systems
Linux,macOS, Windows
Software Engineering
Microservices Architecture, Design Patterns,DevOps, CI/CD,
Web Development
React, Angular, Vue.js, HTML5, CSS3, SASS, WebAssembly
Languages
English
Work proficiency
Mandarin Chinese
Native