Wei Zhang - CV - 20250707

Wei Zhang

Computer Science Researcher at LBNL

Google Scholar

Github

zhangwei217245 [at] gmail [dot] com

Work Experience

Computer Science Researcher

Lawrence Berkeley National Laboratory

Mar 2023 - Now

Scalable I/O Optimization for Large-scale Parallel GNN Training

Led the architectural design of a Rust-based NDArray data store 💾, improving I/O throughput by 31~135× in PyTorch-based graph neural network (GNN) ensemble training workflows.
Engineered an I/O-aware PyTorch DataLoader, significantly reducing contention and cutting I/O wait time by 99% under high-concurrency training scenarios.
Delivered a high-performance, production-ready system that bridges AI model scalability and I/O performance across modern HPC clusters.

Keywords: Software Architect ┃ HPC AI Convergence ┃ Rust High-perf Storage ┃ PyTorch Ecosystem Integration

Exascale End-to-End Object-Centric Data Management

Spearheaded the evolution of team’s software release management standard, including developer experience unification, DevOps best practices and software release review process improvement. These efforts introduced better control on the pace of software development and research execution.
Architected a high-performance, trie-based distributed metadata search system - IDIOMS 💾, achieving 407× faster independent queries and 300× faster collective queries than SQLite, published in CCGrid 2024.
Designed a schema-less binary format for data serialization - up to 10% space reduction as compared to MessagePack with better API for self-guided data parsing process, presented at SC’24.

Keywords: Technical Leadership ┃ CI/CD & DevOps in HPC ┃ Developer Experience ┃ Distributed Data Discovery

Visiting Researcher

The Ohio State University

Mar 2023 - Now

AI-powered Data Discovery for Gray Graph Engine (Collaborative Research with Hewlett Packard Enterprise (HPE))

Led cross-institutional team across LBNL, OSU and HPE to optimize AI inference workflows in HPE’s Cray Graph Engine (CGE), mentoring the Ph.D student at Dr. Suren Byna’s and achieving 63% faster query latency for scientific datasets through -
Dynamic UDF Filtering - Optimized CV inference workflow in both animal taxonomy and facial recognition use cases, enabling real-time feature extraction for 90k+ OpenImages datasets;
Three-Stage Caching Framework - Designed object/target/feature caching strategies, reducing redundant AI computations by 82% across 128-node clusters on Perlmutter;
Feature caching reduced AI inference query time from 74.1s → 0.42s (176× speedup);
Data ingestion overhead limited to less than 5%;
Paper published at IEEE BigData 2024 (acceptance rate - 19.7%)

Keywords: Technical Leadership and Mentorship | AI/ML Inference Reusability | Computer Vision | PyTorch Integration

Adjunct Research Scientist

Texas Tech University

June 2021 - Now

Semantic Query over Scientific Datasets

Mentored the Ph.D. student at Dr. Yong Chen’s group on exploring the intersection of natural language processing, semantic query, LLM, RAG, and scientific data discovery.
Kv2vec - LSTM-based vector embedding for key-value pairs in scientific metadata, reducing semantic search errors from 17.3% → 3.1% (by 80%) vs. traditional methods (published in IEEE HPEC’22).
PSQS - a paralle semantic metadata querying service over self-describing data formats, achieving 20% improvement in query hit rate and 15% higher recall (published in IEEE BigData 2023).
ICEAGE - an LLM-powered metadata search engine that outperforms traditional keyword-based and code-generation methods, achieving 98% query accuracy and 5.43× higher throughput in CPU-based environments and 29.52× in GPU-accelerated settings for scientific data retrieval (under review).

Keywords: Technical Leadership and Mentorship | NLP and Embeddings | Parallel Semantic Search | LLM and RAG

Senior Member of Technical Staff

Oracle Corporation

July 2021 - Feb 2023

Metastore Integration with OCI Data Catalog

Owned and led the integration of OCI Data Catalog (DCAT) into OCI Big Data Service, ensuring seamless metadata federation across 12+ components (HDFS, Spark, Hive, etc.).
Designed and implemented automated DCAT enablement & disablement, ensuring zero-touch configuration for customers deploying big data clusters on OCI cloud.
Engineered a distributed metadata orchestration framework, reducing sync latency by 67% and enhancing data lineage tracking across 50+ OCI regions.

Keywords: Automated Metadata Federation | Cloud Data Lake Governance | Multi-Region Service Orchestration

Security & Identity Management for OCI Big Data Service

Owned and implemented the Active Directory (AD) integration, enabling enterprise-grade authentication for all 12+ big data components, reducing enterprise onboarding time by 65%.
Standardized UID/GID registration and access control frameworks, ensuring 85% fewer identity conflicts in multi-tenant cloud environments.

Keywords: Enterprise Identity & Access Management (IAM) | Big Data Security & Compliance | Role-Based Access Control (RBAC)

External Service Integration Framework for Customizable Cluster

Designed and implemented a dynamic cluster profile switching mechanism, allowing users to select and provision tailored big data environments, reducing cluster deployment time by 60%.
Developed a pluggable integration model, ensuring that external services (DCAT, Object Storage, etc.) dynamically adapt to different component combinations, providing future-proof extensibility as new profiles (e.g., Iceberg, Delta Lake) are introduced.
Architected a multi-region-aware deployment strategy, ensuring low-latency interoperability across OCI’s 50+ public regions and private cloud environments.

Keywords: Customizable Big Data Cluster Profiles | Dynamic Service Orchestration | Multi-Region Deployment Automation | Future-Proofed External Integrations

Research Assistant

Data-Intensive Scalable Computing Laboratory (DISCL), Texas Tech University

Aug 2017 - May 2021

AI-driven Data Retention Framework (published in SC ‘21)

Led cross-institutional collaboration involving 2 national labs and an R1 university, resulting in -
Pioneered ActiveDR, an AI-driven data retention system designed for HPC environments, optimizing storage efficiency by prioritizing file retention for active users while purging inactive ones.
Evaluated on Titan supercomputer traces, it reduces file misses by up to 37% and retains 213% more data for active users while maintaining the same level of space utilization compared to traditional fixed-lifetime retention methods.

Keywords: HPC Storage Management | Activeness-based Data Retention | User-Centric Storage Optimization

Metadata Indexing and High-Performance Data Search Engine for HPC Data Management

Designed and Implemented MIQS - a Metadata Indexing and Querying Service for high-performance metadata search over self-describing data formats such as HDF5, NetCDF, etc. This work enables direct metadata indexing over data while eliminating the need of setting up external databases. It achieves 172k× faster metadata searches than MongoDB-based solutions, 99% reduction in index construction time, and 75% lower memory footprint, making it an efficient and portable alternative for HPC scientific data management (Published in SC ‘19).
Designed and Implemented DART - a Distributed Adaptive Radix Tree for high-performance distributed affix-based keyword search, achieving 55× higher throughput than Distributed Hash Tables (DHTs) for prefix and suffix searches while ensuring balanced keyword distribution, reducing query contention and improving scalability (Published in PACT ‘18).

Keywords: High-performance In-memory Index | In-memory Search Optimization | HPC Distributed Indexing and Search | Load-balanced Keyword Distribution | High-throughput Affix-based Search

Graph Partitioning Algorithm for Distributed Graph Data Databases

Designed and implemented AKIN - similarity-based streaming graph partitioning algorithm that improves data locality and reduces edge-cut ratio in distributed graph storage systems. By leveraging vertex similarity, AKIN reduces edge-cut ratio by up to 20% compared to FENNEL, while maintaining balanced partitions with minimal overhead, making it a superior alternative to IOGP for real-time graph partitioning (Published in CCGrid ‘18).
Initiated and Shaped the core idea of IOGP - the first multi-stage, online graph partitioning algorithm designed for distributed graph databases, dynamically adjusting partitions as data evolves. By leveraging vertex connectivity and degree changes, IOGP improves query performance by 2× while maintaining balanced partitions with less than 10% overhead as compared to FENNEL (Published in HPDC ‘17).

Keywords: Distributed Graph Databases | Streaming Graph Partitioning | Online Graph Partitioning | High-performance Graph OLTP Operation

Other Achievements

Secured NSF research funding by leading the successful development and submission of a competitive proposal, demonstrating expertise in grant writing and research leadership.
Mentored a Ph.D. student, guiding their research direction, publication strategy, and technical growth, contributing to the advancement of the research group.
Released two open-source software tools, enabling broader adoption of HPC research innovations, including -
Two software releases:
- ActiveDR - software release of SC '21 study (DOI: 10.5281/zenodo.5168853)
- MIQS - software release of SC '19 study (DOI: 10.11578/dc.20210322.3)

Keywords: Translational Research | Research Grant Acquisition | Leadership and Mentoring Skills

Research Assistant

STARLab, Texas Tech University

Jan 2016 - Dec 2016

Scalable Geospatial Data Mining & Visualization

Designed and deployed a distributed geospatial data mining infrastructure using Apache Spark, HBase, and HDFS, reducing overall large-scale geospatial analytics time from 12 months to 3 months.
Optimized big data compression and indexing strategies, reducing storage overhead by 40% and improving query latency by 60% for spatial datasets.
Developed a real-time geospatial visualization system, leveraging GDAL, Redis, and NodeJS, enabling high-resolution mapping of social media demographics across 10+ regions.
Conducted advanced geospatial data mining and demographic analysis on 500M+ geo-tagged Twitter records across 5 years, providing insights into geospatial demographic pattern and socioeconomic trends.

Senior System R&D Engineer

Beijing Serious Technology Co., Ltd

Jan 2014 - Jan 2016

Scalable Data API & Messaging Architecture

Architected & developed Meshwork, a high-performance graph-based data access API supporting MySQL & Redis, increasing query efficiency by 5× in distributed systems.
Designed & implemented BrookSide, a low-latency AMQP-based messaging framework using RabbitMQ, improving inter-service communication by 40%.
Led the development of Webshot-rest-amqp-service, a Node.js-based snapshot automation tool, enabling real-time monitoring & web scraping at 10× faster capture rates.
Developed PCVF, a Parameter Constraining & Validation Framework for secure REST API governance, reducing API failures by 60% in mission-critical applications.
Integrated DevOps best practices with Maven, Jenkins, and JUnit, streamlining RESTful API deployment and improving CI/CD efficiency by 3×.

Keywords: Graph-Based Data Access (MySQL, Redis) | High-Throughput Message Processing (RabbitMQ, AMQP) | REST API Governance & Validation | Cloud-Based Web Scraping | CI/CD for Distributed Systems

System R&D Engineer

Sina.com Technology (China) Co., Ltd (Weibo)

Jul 2010 - May 2013

Weibo Platform API & Scalable Data Services

Designed & Unified Weibo’s REST API, setting the foundation for its scalable platform (Patented: CN103049271B), improving API consistency & usability for 500M+ users.
Owned & led the development of T.cn, Weibo’s high-availability URL shortening service, processing millions of daily requests with sub-ms latency.
Owned & led Weibo’s User Data Service, a critical data infrastructure handling billions of user interactions, ensuring 99.9% uptime & horizontal scalability.
Optimized API query pipelines, reducing data retrieval latency by 20% and enhancing real-time content recommendations.

Keywords: Platform-Level API Design (REST, OAuth) | High-Availability Web Services (Weibo, T.cn) | Mass-Scale Data Infrastructure (Big Data Processing) | URL Shortening & Tracking (Weibo Analytics) | Scalable Content Distribution

Earlier Experience

Beijing JustMusic & Datuu.com Technology

Feb 2009 - Jan 2013

Business Data Management & Operational Systems

Developed & optimized a business data management system, improving operational efficiency by 3×.
Designed & implemented a batch processing framework, optimizing data transformation pipelines for higher throughput.
Developed & maintained an operation management system, ensuring seamless feature integration & data consistency.
Implemented a business reporting module, enabling automated, real-time business insights.

Keywords: Business Data Management | Batch Processing Optimization | Operations & Reporting Automation | Enterprise Software Development

Publications

Papers

2025

W. Zhang, K. Ibrahim, and S. Byna. Optimizing Distributed Object Storage I/O for Large-scale Parallel GNN Training on Atomistic Graphs (under review)

2025

S. Saha, H. Tang, W. Zhang, and S. Byna. Distributed Metadata Querying on HPC Systems (under review)

2025

C. Niu, W. Zhang, Y. Zhao, and Y. Chen. Energy Efficient or Exhaustive? Benchmarking Power Consumption of LLM Inference Engines , in the HotCarbon Workshop on Sustainable Computer Systems 2025 (HotCarbon '25). (accepted)

2025

C. Niu, W. Zhang, M. Side, and Y. Chen. ICEAGE: Intelligent Contextual Exploration and Answer Generation Engine for Scientific Data Discovery , in the Proceedings of the 37th International Conference on Scalable Scientific Data Management (SSDBM 2025).

2024

H. Oh, W. Zhang, C. Rickett, S. Sukumar, and S. Byna. Evaluating Performance Trade-offs of Caching Strategies for AI-Powered Querying Systems , in the Proceedings of the 2024 IEEE International Conference on Big Data (IEEE BigData 2024). (Acceptance Rate: 19.7%)

2024

W. Zhang, H. Tang, and S. Byna. IDIOMS: Index-powered Distributed Object-centric Metadata Search for Scientific Data Management , in the Proceedings of 2024 IEEE/ACM international Symposium on Cluster, Cloud and Internet Computing (CCGrid 2024) (CCGrid 2024).

2023

C. Niu, W. Zhang, S. Byna, and Y. Chen. PSQS: Parallel Semantic Querying Service for Self-describing File Formats , in the 2023 IEEE International Conference on Big Data (IEEE BigData 2023).

2022

C. Niu, W. Zhang, S. Byna, and Y. Chen. Kv2vec: A Distributed Representation Method for Key-value Pairs from Metadata Attributes , in the Proceedings of 2022 IEEE High Performance Extreme Computing Conference (HPEC '22). (acceptance rate: 30/120=25%)

2021

W. Zhang, S. Byna, H. Sim, S. Lee, S. Vazhkudai, and Y. Chen. Exploiting User Activeness for Data Retention in HPC Systems , in the Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '21). (first-around acceptance rate: 86/365=23.6%, another 13 papers being asked for major revisions per SC’21)

2020

N. Zhao, G. Cao, W. Zhang, E. Samson, and Y. Chen. Remote sensing and social sensing for socioeconomic systems: A comparison study between nighttime lights and location-based social media at the 500 m spatial resolution . International Journal of Applied Earth Observation and Geoinformation.

2019

D. Dai, Y. Chen, P. Carns, J. Jenkins, W. Zhang, and R. Ross. Managing Rich Metadata in High-Performance Computing Systems Using a Graph Model . IEEE Transactions on Parallel and Distributed Systems.

2019

N. Zhao, W. Zhang, Y. Liu, E. Samson, Y. Chen, and G. Cao. Improving Nighttime Light Imagery With Location-Based Social Media Data . IEEE Transactions on Geoscience and Remote Sensing.

2019

W. Zhang, S. Byna, C. Niu, and Y. Chen. Exploring Metadata Search Essentials for Scientific Data Management , in the Proceedings of 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC '19). (acceptance rate: 23%)

2019

W. Zhang, S. Byna, H. Tang, B. Williams, and Y. Chen. MIQS: Metadata Indexing and Querying Service for Self-Describing File Formats , in the Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '19). (first-around acceptance rate: 72/344=21%, another 15 papers being asked for major revisions per SC '19)

2018

N. Zhao, G. Cao, W. Zhang, and E. Samson. Tweets or nighttime lights: Comparison for preeminence in estimating socioeconomic factors . ISPRS Journal of Photogrammetry and Remote Sensing.

2018

W. Zhang, H. Tang, S. Byna, and Y. Chen. DART: Distributed Adaptive Radix Tree for Efficient Affix-Based Keyword Search on HPC Systems , in the Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques (PACT '18). (acceptance rate: 36/126=28.6%)

2018

W. Zhang, Y. Chen, and D. Dai. AKIN: A Streaming Graph Partitioning Algorithm for Distributed Graph Storage Systems , in the Proceedings of 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID '18). (acceptance rate: 20.8%)

2017

D. Dai, W. Zhang, and Y. Chen. IOGP: An Incremental Online Graph Partitioning Algorithm for Distributed Graph Databases , in the Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '17). (acceptance rate: 19%)

2016

D. Dai, Y. Chen, P. Carns, J. Jenkins, W. Zhang, and R. Ross. GraphMeta: A Graph-Based Engine for Managing Large-Scale HPC Rich Metadata , in the Proceedings of 2016 IEEE International Conference on Cluster Computing (CLUSTER '16). (acceptance rate: 39/162=24.1%)

Patents

2015

Wei Zhang Method and Apparatus for Automatic Generation of API Interface, Patent No. CN103049271B (https://patents.google.com/patent/CN103049271B/en) . Assigned to Beijing Weimeng Chuangke Network Technology Co., Ltd.. Patent granted in China

Software Releases

2020

W. Zhang, S. Byna, Y. Chen, National Science Foundation, USDOE, National Science Foundation, and National Science Foundation. MIQS v0.6 (link: https://www.osti.gov/biblio/1772233).

2021

W. Zhang, H. and Sim, S. Lee, S. Byna, Y. Chen, and S. Vazhkudai. Implementation of the Article "Exploiting User Activeness for Data Retention in HPC Systems" (link: https://doi.org/10.5281/zenodo.5168853).

Thesis and Dissertation

2021

W. Zhang. Efficient scientific data discovery over self-describing file formats, Texas Tech University

Extended Abstracts

2020

S. Byna, Q. Koziol, H. Tang, W. Zhang, and Y. Chen. Searching metadata stored in self-describing file formats efficiently.

Posters

2020

W. Zhang. Activeness-based Data Retention Recommender for HPC Facilities, SC '20 ACM Graduate Student Research Competition Poster

2020

W. Zhang. Efficient Metadata Search for Scientific Data, SC '20 Doctoral Showcase Poster

2020

C. Niu, W. Zhang, S. Byna, and Y. Chen. Semantic Search for Self-Describing Scientific Data Formats, SC '20 Research Poster

2018

W. Zhang, H. Tang, S. Byna, and Y. Chen. Distributed Adaptive Radix Tree for Efficient Metadata Search on HPC Systems, SC '18 Research Poster

2018

W. Zhang, Y. Chen, and D. Dai. AKIN: A Streaming Graph Partitioning Algorithm for Distributed Graph Storage Systems, CCGRID '18 Research Poster

2017

D. Dai, W. Zhang, and Y. Chen. POSTER: IOGP: An Incremental Online Graph Partitioning for Large-Scale Distributed Graph Databases, PPoPP '17 Research Poster

Grant Experience

Collaborative Research: SHF: Medium: Redefining Metadata Search for Scientific Dataset Discovery

2021

Independently led blueprint planning and idea shaping sessions, empowering team members to contribute their expertise and laying a solid foundation for successful project implementation.
Mentored and guided team members in meticulous research task decomposition and planning, fostering their skills in resource allocation and ensuring the timely execution of project milestones.
Provided coaching and support to team members in developing exceptional writing skills for major sections within funding proposals, enabling them to effectively communicate project objectives, methodologies, and anticipated outcomes.
Assumed a coaching role in revising logistic items, guiding team members through meticulous review and refinement of project details to enhance clarity, coherence, and overall proposal quality while fostering their professional growth.

Collaborative Research - SHF - Medium - Empowering Scientific Dataset Discovery through Self-contained, Semantic, and Linked Metadata Search

2020

Proficiently engaged in blueprint planning and idea shaping, laying a solid foundation for successful project implementation.
Conducted meticulous research task decomposition and planning, ensuring efficient allocation of resources and timely execution of project milestones.
Demonstrated exceptional writing skills in the development of major sections within funding proposals, effectively communicating project objectives, methodologies, and anticipated outcomes.
Assumed responsibility for revising logistic items, meticulously reviewing and refining project details to enhance clarity, coherence, and overall proposal quality.

Science FAIR - FAIR Data Management for Scientific AI Applications

2020

Applied adept idea shaping techniques to refine and structure partial content, ensuring coherence and alignment with project goals within the funding proposal.
Demonstrated excellent writing skills in effectively conveying background information and motivation behind the proposed project, highlighting its significance and potential impact.
Authored a compelling section focusing on an essential research task, providing a comprehensive overview of the task's objectives, methodology, and expected outcomes, thereby strengthening the overall proposal's technical merit.

Teaching Experience

Graduate Courses

Advanced Parallel Computing

Invited Lecturer | Ohio State University

Parallel Computing

Invited Lecturer | Texas Tech University

Advanced Operating Systems

Course Project Designer and Mentor | Texas Tech University

Undergraduate Courses

Data Structures

Lab Instructor, Grader and Tutoring Session Host | Texas Tech University

Object Oriented Programming

Grader and Tutoring Session Host | Texas Tech University

Computer Architecture

Grader and Tutoring Session Host | Texas Tech University

Services

Panelist / Committee Member

2025

Program Committee Member

Committee Member Of The 10th International Parallel Data Systems Workshop (PDSW'25, Held In Conjunction With SC25)

2025

Program Committee Member

Program committee member for the 38th International Conference for High Performance Computing, Networking, Storage and Analysis (SC '25)

2024

Program Committee Member

Committee Member Of The 9th International Parallel Data Systems Workshop (PDSW'24, Held In Conjunction With SC24)

2024

Program Committee Member

Program committee member for the 36th International Conference on Scientific and Statistical Database Management (SSDBM '24)

2024

Program Committee Member

Program committee member for the 37th International Conference for High Performance Computing, Networking, Storage and Analysis (SC '24)

2023

Program Committee Member

Program committee member for the 24th IEEE/ACM international Symposium on Cluster, Cloud and Internet Computing (CCGrid 2024)

2023

DOE ASCR Panelist

Panelist for the Advanced Scientific Computing Research (ASCR) program of Department of Energy (DOE) funding opportunity - Distributed Resilient Systems.

2022

DOE ASCR Panelist

Panelist for the Advanced Scientific Computing Research (ASCR) program of Department of Energy (DOE) funding opportunity - Management and Storage of Scientific Data

2020

Program Committee Member

Program committee member for the 27th edition of the IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC 2020)

Invited Paper Reviewer

5th IPDPS Workshop on Extreme-Scale Storage and Analysis (ESSA 2024)

IEEE International Transactions on Parallel and Distributed Computing Systems (TPDS)

The 38th IEEE International Parallel & Distributed Processing Symposium (IPDPS '24)

The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '23)

The 52nd International Conference on Parallel Processing (ICPP '23)

IEEE International Transactions on Parallel and Distributed Computing Systems (TPDS)

IEEE International Parallel and Distributed Processing Symposium (IPDPS)

IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGrid)

International Conference on Parallel Processing (ICPP)

IEEE International Conference on Cluster Computing (Cluster)

IEEE International Conference on Cloud Computing (CLOUD)

IEEE International Conference on BigData (BigData)

International Conference on Utility and Cloud Computing (UCC)

IFIP International Conference on Network and Parallel Computing (NPC)

International Parallel Data Systems Workshop (PDSW)

International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2)

IEEE International Workshop on High-Performance Big Data and Cloud Computing (HPBDC)

IEEE Open Access

Presentations

Invited Talks

May 2025

Advancing Scientific Data Discovery and Management: From Human to AI-Centric Data Cognition, NSDF All-hands Meeting 2025, San Diego, CA, USA

August 2024

Distributed Affix-Based Metadata Search in Self-Describing Data Files, HDF5 User Group Meeting 2024, Chicago, IL, USA

August 2023

Towards Self-contained Metadata Search Capability for Self-describing File Formats, HDF5 User Group Meeting 2023, Columbus, OH, USA

Conference Presentations

Feb 2025

BULKI - Binary Unified Layout for Key-Value Interchange, LBNL Postdoc Symposium 2025, Berkeley, CA, USA

Nov 2024

BULKI - Binary Unified Layout for Key-Value Interchange, PDSW WIP Session(co-located with SC'24), Atlanta, GA, USA

May 2024

IDIOMS - Index-powered Distributed Object-centric Metadata Search for Scientific Data Management, IEEE CCGrid 2024, Philadelphia, PA, USA

Feb 2024

IDIOMS - Index-powered Distributed Object-centric Metadata Search for Scientific Data Management, LBNL Postdoc Symposium 2024, Berkeley, CA, USA

Nov 2021

Exploiting User Activeness for Data Retention in HPC Systems, the 33rd ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '21)

Nov 2020

Efficient Metadata Search for Scientific Data, the 32nd ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis (SC ’20)

Nov 2019

MIQS - Metadata Indexing and Querying Service for Self-describing Data Formats, the 31st ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis (SC ’19)

Nov 2018

DART - Distributed Adaptive Radix Tree for Efficient Affix-based Keyword Search on HPC Systems, the 27th International Conference on Parallel Architectures and Compilation Techniques (PACT ’18)

Nov 2018

Attributed Consistent Hashing for Heterogeneous Storage Systems, the 27th International Conference on Parallel Architectures and Compilation Techniques(PACT ’18)

Jul 2018

I/O Characteristics Discovery in Cloud Storage Systems, 2018 IEEE International Conference on Cloud Computing(Cloud ’18)

May 2018

AKIN - A Streaming Graph Partitioning Algorithm for Distributed Graph Storage Systems, 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid’18)

Research Seminar Talks

Nov 2020

Concurrent Metadata Indexing for Scientific Data Harvesting,

Aug 2020

Enabling High Throughput Concurrent In-Memory Metadata Indexing for Scientific Data Harvesting,

May 2020

A Data Retention Recommender System for HPC Facilities,

Mar 2020

A Recommender System for Promoting Scientific Research Collaboration and Data Sharing,

Nov 2019

On Scientific Data Discoverability,

Oct 2019

What Does a Bad Paper Look Like? Some Thoughts After A Paper Review,

Jun 2019

MIQS - Metadata Indexing and Querying Service for Self-Describing File Formats,

Feb 2019

Exploring Metadata Search Primitives for Scientific Data Management,

Nov 2018

Metadata Indexing and Search for Self-contained Scientific Data Management Models,

Sep 2018

Lightweight Metadata Search Service for Experimental and Observational Datasets,

Apr 2018

From Index to Metadata,

Feb 2018

Distributed Adaptive Radix Tree and Metadata Indexing,

Dec 2017

Distributed Keyword Search for Metadata,

Sep 2017

Towards Flexible and Efficient Metadata Search,

May 2017

Data Management for Large-Scale Graph-Oriented Applications,

Jan 2017

A Tutorial on CloudLab,

Nov 2016

Geospatial Data Mining on Spark and HBase,

Oct 2016

Similarity-Based Streaming Graph Partitioning for Distributed Graph Storage Systems,

Jun 2016

Similarity-Based Graph Data Placement Strategy for Graph-based Applications on HPC,

Mar 2016

An Online Graph Partitioner for Graph-Based Metadata Management System in High Performance Computing,

Dec 2015

Data Partitioning on High-Performance Graph Computing System - Motivation, Exploration and Innovation,

Video Presentations

Nov 2020

Activeness-based Data Retention Recommender for HPC Facilities (A 5-minute audio presentation for ACM Graduate Student Research Competition Poster at SC ’20),

Oct 2018

DART - Distributed Adaptive Radix Tree for Efficient Affix-based Keyword Search on HPC Systems (A 3-minute video teaser presentation for technical program presentation at PACT ’18),

Education

Texas Tech University

Aug 2014 - May 2021 | Lubbock, TX, USA

Ph.D. in Computer Science

Hebei University of Science and Technology

Sept 2003 - Jun 2007 | Shijiazhuang, Hebei, China

BSc in Computer Science

Skills

Programming Languages

Python, JavaScript, Java, C++, C#, Go, Rust, TypeScript

Server-side Development

Node.js, Django, ASP.NET Core, Spring Boot, Express.js, FastAPI

Databases

PostgreSQL, MySQL, MongoDB, Redis, Cassandra, Neo4j

Big Data Tools

Apache Spark, Hadoop, Apache Kafka, Apache Flink

Cloud Computing

AWS, Microsoft Azure, Google Cloud Platform, Docker, Kubernetes

Operating Systems

Linux,macOS, Windows

Software Engineering

Microservices Architecture, Design Patterns,DevOps, CI/CD,

Web Development

React, Angular, Vue.js, HTML5, CSS3, SASS, WebAssembly

Languages

English

Work proficiency

Mandarin Chinese

Native