Wei Zhang
Computer Science Researcher at LBNL
zhangwei217245 [at] lbl [dot] gov
Work Experience
Postdoctoral Researcher
Lawrence Berkeley National Laboratory
Mar 2023 - Now
  • Led the IDIOMS project, addressing complex technical challenges in distributed metadata indexing to significantly enhance data management capabilities, contributing to growth in distributed computing and leadership, aligning with the Laboratory's mission to advance scientific data management.
  • Led the exploration and implementation of data I/O optimization techniques for Graph Neural Networks (GNN) in scientific applications. This research directly improves AI training efficiency by refining data access patterns and storage techniques, thereby contributing to the advancement of scalable AI solutions.
  • Spearheaded research initiatives and mentoring Ph.D. students on the direction of AI for Scientific Data Discovery, which aims to revolutionize how researchers interact with and derive insights from complex data. This work domain specific models, large language models (LLM) and retrieval augmented generation (RAG) techniques.
  • Developed BULKI, a novel data format designed to address limitations in traditional data serialization methods. BULKI's flexibility and efficiency make it an ideal candidate for supporting advanced data management and AI-driven applications, particularly in environments requiring rapid and adaptable data processing.
  • Led the evaluation and optimization of scientific data management techniques, including enhancing metadata indexing and querying within the Proactive Data Container (PDC) project and advancing multi-dimensional data stitching for lattice light-sheet microscopy. These efforts significantly improved data retrieval and processing capabilities in scientific applications.
Visiting Researcher
The Ohio State University
Mar 2023 - Now
  • Research collaboration with Prof. Suren Byna's group on scientific data management, particularly focusing on data discovery, and AI-powered data discovery.
  • Mentoring junior Ph.D. students on the direction of scientific data management.
Adjunct Research Scientist
Texas Tech University
June 2021 - Now
  • Research collaboration with Prof. Yong Chen's group on scientific data management, particularly focusing on data discovery, provenance, and semantic scientific data discovery.
  • Mentoring junior Ph.D. students on the direction of semantic scientific data discovery, exploring the intersection of natural language processing, semantic query, LLM, RAG, and scientific data discovery.
Senior Member of Technical Staff
Oracle Corporation
July 2021 - Feb 2023
  • Directed critical initiatives within Oracle’s OCI Big Data Service, including the integration of the OCI Data Catalog Metastore and the optimization of Active Directory. These projects enhanced platform security, reliability, and scalability, and established best practices that drove operational excellence across the organization.
  • Spearheaded the development and implementation of key coordination and security services, including the UID/GID Coordinating Service and Security Management Module. These efforts significantly strengthened system efficiency and resilience, ensuring robust performance and adaptability in a high-demand environment.
  • Led the design and execution of the External Service Integration Framework in the Cluster Profile Project, focusing on metadata-driven access control. This initiative was pivotal in enhancing system flexibility and supporting the platform’s scalability to meet evolving business needs.
Research Assistant
Data-Intensive Scalable Computing Laboratory (DISCL), Texas Tech University
Aug 2017 - May 2021
  • Pioneered the development of advanced data management solutions in HPC systems, including a Distributed Adaptive Radix Tree for affix-based keyword search and a Metadata Indexing and Querying Service for self-describing data formats, significantly enhancing search and data retention capabilities. The study has been published in PACT '18, SC '19, and HiPC '20.
  • Spearheaded the development of innovative approaches for graph partitioning and storage resource management, including leading the creation of a Similarity-based Streaming Graph Partitioning Algorithm for Distributed Graph Storage Systems, which was published at CCGrid '18 and HPDC '17. These contributions have been pivotal in optimizing storage and computational efficiency in distributed systems.
  • Led a collaborative effort integrating researchers from two national laboratories and an R1 university to develop an innovative data retention solution for data centers. This research is published at SC '21, highlighting the impact of analyzing user activeness in enhancing data management efficiency in high-performance computing systems.
  • Led the successful development and submission of an NSF funding proposal, demonstrating strong expertise in grant writing and research project management.
  • Mentored and guided a junior Ph.D. student, fostering their academic and research growth, while contributing to the advancement of the research group.
  • Two software releases:
    • ActiveDR - software release of SC '21 study (DOI: 10.5281/zenodo.5168853)
    • MIQS - software release of SC '19 study (DOI: 10.11578/dc.20210322.3)
Research Assistant
STARLab, Texas Tech University
Jan 2016 - Dec 2016
  • Led the development and deployment of a comprehensive data mining infrastructure leveraging cutting-edge technologies such as Apache Spark, HBase, and HDFS. This infrastructure was pivotal in setting up a distributed big data cluster specifically designed for efficient geospatial data mining.
  • Directed strategic optimizations across the big data software stack, with a focus on implementing unified data compression techniques. These optimizations significantly enhanced the performance and scalability of geospatial data processing workflows.
  • Pioneered scalable geo-spatial visualization solutions by deploying a robust visualization system for the distribution of social media users. This solution utilized GDAL in combination with NodeJS, Python, and Redis, providing geoscientists with powerful tools to analyze and interpret complex spatial data.
  • Executed advanced geospatial data mining and demographic analysis using technologies such as Apache Spark, HBase, and Hadoop. This work included extracting and analyzing demographic information from Twitter data, offering new insights into population dynamics and social behavior.
  • Initiated and led innovative research projects, including a comparative study titled "Remote Sensing and Social Sensing for Socioeconomic Systems," which examined the differences between Nighttime Lights and Location-based Social Media data. This research highlighted the potential of integrating social media data with traditional remote sensing for enhanced socioeconomic analysis.
Senior System R&D Engineer
Beijing Serious Technology Co., Ltd
Jan 2014 - Jan 2016
  • Architected and developed Meshwork, a sophisticated graph-like data access API supporting both MySQL and Redis for seamless and optimized data retrieval. This API significantly improved data access efficiency across distributed systems.
  • Designed and implemented BrookSide, a high-performance message processing framework built on AMQP protocols, specifically RabbitMQ, enabling efficient and reliable communication across microservices.
  • Led the development of the Webshot-rest-amqp-service project, a cutting-edge NodeJS application that automates the capture of website snapshots based on AMQP messages from RabbitMQ. This tool enhanced automated monitoring and web scraping capabilities.
  • Spearheaded the creation of PCVF, a robust Parameter Constraining and Validating Framework for RESTful Web Service APIs, developed as part of a confidential project. This framework ensured API reliability and security, aligning with stringent project requirements.
  • Guided the adoption of advanced DevOps practices, including the integration of Maven, Jenkins, Unit Testing, and a custom document generator. These practices were implemented to support RESTful Web Service APIs, ensuring seamless compatibility with the PCVF framework and enhancing the development workflow.
System R&D Engineer
Sina.com Technology (China) Co.,Ltd.
Jul 2010 - May 2013
  • Spearheaded the unification and optimization of the Weibo REST API, setting a strategic foundation for the platform's scalable growth by standardizing design, development, documentation, and testing processes. This initiative, which was ultimately patented in CN103049271B, enabled Weibo to efficiently evolve and meet the demands of its rapidly expanding user base.
  • Owned and led the development of T.cn, Weibo's URL shortening service, overseeing all critical aspects including design, implementation, and tracking infrastructure. This project was pivotal in enhancing content management and analytics across the platform.
  • Led the User Data Service team at Weibo's data platform, driving the development of new features, advancing technical capabilities, and ensuring the scalability and stability of the platform’s critical data services.
Senior Software Developer
Beijing JustMusic Co.,Ltd.
Feb 2009 - Jun 2010
  • Spearheaded the end-to-end development of a sophisticated business data management system, leveraging software engineering expertise to ensure efficient data handling, storage, and retrieval.
  • Designed and implemented a streamlined batch processing framework, enabling seamless execution of data processing tasks and optimizing system performance for enhanced productivity.
Software Developer
Beijing Datuu.com Technology Co.,Ltd.
Jan 2008 - Jan 2009
  • Developed an operation management system, taking charge of routine feature development, data maintenance, and ensuring seamless integration of essential functionalities.
  • Implemented a robust business reporting module within the operation management system, enabling accurate and timely generation of business reports to facilitate informed decision-making
Publications
Journal Papers
2020
N. Zhao, G. Cao, W. Zhang, E. Samson, and Y. Chen. Remote sensing and social sensing for socioeconomic systems: A comparison study between nighttime lights and location-based social media at the 500 m spatial resolution. International Journal of Applied Earth Observation and Geoinformation.
2019
D. Dai, Y. Chen, P. Carns, J. Jenkins, W. Zhang, and R. Ross. Managing Rich Metadata in High-Performance Computing Systems Using a Graph Model. IEEE Transactions on Parallel and Distributed Systems.
2019
N. Zhao, W. Zhang, Y. Liu, E. Samson, Y. Chen, and G. Cao. Improving Nighttime Light Imagery With Location-Based Social Media Data. IEEE Transactions on Geoscience and Remote Sensing.
2018
N. Zhao, G. Cao, W. Zhang, and E. Samson. Tweets or nighttime lights: Comparison for preeminence in estimating socioeconomic factors. ISPRS Journal of Photogrammetry and Remote Sensing.
Conference Papers
2024
H. Oh, W. Zhang, C. Rickett, S. Sukumar, and S. Byna. Evaluating Performance Trade-offs of Caching Strategies for AI-Powered Querying Systems, in the Proceedings of the IEEE International Conference on Big Data (IEEE BigData 2024). (Acceptance Rate: 19.7%)
2024
W. Zhang, H. Tang, and S. Byna. IDIOMS: Index-powered Distributed Object-centric Metadata Search for Scientific Data Management, in the Proceedings of The 24th IEEE/ACM international Symposium on Cluster, Cloud and Internet Computing (CCGrid 2024).
2023
C. Niu, W. Zhang, S. Byna, and Y. Chen. PSQS: Parallel Semantic Querying Service for Self-describing File Formats, in the 2023 IEEE International Conference on Big Data (IEEE BigData 2023).
2022
C. Niu, W. Zhang, S. Byna, and Y. Chen. Kv2vec: A Distributed Representation Method for Key-value Pairs from Metadata Attributes, in the Proceedings of 2022 IEEE High Performance Extreme Computing Conference (HPEC '22). (acceptance rate: 30/120=25%)
2021
W. Zhang, S. Byna, H. Sim, S. Lee, S. Vazhkudai, and Y. Chen. Exploiting User Activeness for Data Retention in HPC Systems, in the Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '21). (first-around acceptance rate: 86/365=23.6%, another 13 papers being asked for major revisions per SC’21)
2019
W. Zhang, S. Byna, C. Niu, and Y. Chen. Exploring Metadata Search Essentials for Scientific Data Management, in the Proceedings of 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC '19). (acceptance rate: 23%)
2019
W. Zhang, S. Byna, H. Tang, B. Williams, and Y. Chen. MIQS: Metadata Indexing and Querying Service for Self-Describing File Formats, in the Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '19). (first-around acceptance rate: 72/344=21%, another 15 papers being asked for major revisions per SC '19)
2018
W. Zhang, H. Tang, S. Byna, and Y. Chen. DART: Distributed Adaptive Radix Tree for Efficient Affix-Based Keyword Search on HPC Systems, in the Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques (PACT '18). (acceptance rate: 36/126=28.6%)
2018
W. Zhang, Y. Chen, and D. Dai. AKIN: A Streaming Graph Partitioning Algorithm for Distributed Graph Storage Systems, in the Proceedings of 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID '18). (acceptance rate: 20.8%)
2017
D. Dai, W. Zhang, and Y. Chen. IOGP: An Incremental Online Graph Partitioning Algorithm for Distributed Graph Databases, in the Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '17). (acceptance rate: 19%)
2016
D. Dai, Y. Chen, P. Carns, J. Jenkins, W. Zhang, and R. Ross. GraphMeta: A Graph-Based Engine for Managing Large-Scale HPC Rich Metadata, in the Proceedings of 2016 IEEE International Conference on Cluster Computing (CLUSTER '16). (acceptance rate: 39/162=24.1%)
Patents
2015
Wei Zhang Method and Apparatus for Automatic Generation of API Interface, Patent No. CN103049271B (https://patents.google.com/patent/CN103049271B/en) . Assigned to Beijing Weimeng Chuangke Network Technology Co., Ltd.. Patent granted in China
Software Releases
2020
W. Zhang, S. Byna, Y. Chen, National Science Foundation, USDOE, National Science Foundation, and National Science Foundation. MIQS v0.6 (link: https://www.osti.gov/biblio/1772233).
Thesis and Dissertation
2021
W. Zhang. Efficient scientific data discovery over self-describing file formats, Texas Tech University
Extended Abstracts
2020
S. Byna, Q. Koziol, H. Tang, W. Zhang, and Y. Chen. Searching metadata stored in self-describing file formats efficiently.
Posters
2020
W. Zhang. Activeness-based Data Retention Recommender for HPC Facilities, SC '20 ACM Graduate Student Research Competition Poster
2020
W. Zhang. Efficient Metadata Search for Scientific Data, SC '20 Doctoral Showcase Poster
2020
C. Niu, W. Zhang, S. Byna, and Y. Chen. Semantic Search for Self-Describing Scientific Data Formats, SC '20 Research Poster
2018
W. Zhang, H. Tang, S. Byna, and Y. Chen. Distributed Adaptive Radix Tree for Efficient Metadata Search on HPC Systems, SC '18 Research Poster
2018
W. Zhang, Y. Chen, and D. Dai. AKIN: A Streaming Graph Partitioning Algorithm for Distributed Graph Storage Systems, CCGRID '18 Research Poster
2017
D. Dai, W. Zhang, and Y. Chen. POSTER: IOGP: An Incremental Online Graph Partitioning for Large-Scale Distributed Graph Databases, PPoPP '17 Research Poster
Presentations
Conference Presentations
August 2024
Distributed Affix-Based Metadata Search in Self-Describing Data Files, HDF5 User Group Meeting 2024, Chicago, IL, USA
May 2024
IDIOMS - Index-powered Distributed Object-centric Metadata Search for Scientific Data Management, IEEE CCGrid 2024, Philadelphia, PA, USA
Feb 2024
IDIOMS - Index-powered Distributed Object-centric Metadata Search for Scientific Data Management, LBNL Postdoc Symposium 2024, Berkeley, CA, USA
August 2023
Towards Self-contained Metadata Search Capability for Self-describing File Formats, HDF5 User Group Meeting 2023, Columbus, OH, USA
Nov 2021
Exploiting User Activeness for Data Retention in HPC Systems, the 33rd ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '21)
Nov 2020
Efficient Metadata Search for Scientific Data, the 32nd ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis (SC ’20)
Nov 2019
MIQS - Metadata Indexing and Querying Service for Self-describing Data Formats, the 31st ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis (SC ’19)
Nov 2018
DART - Distributed Adaptive Radix Tree for Efficient Affix-based Keyword Search on HPC Systems, the 27th International Conference on Parallel Architectures and Compilation Techniques (PACT ’18)
Nov 2018
Attributed Consistent Hashing for Heterogeneous Storage Systems, the 27th International Conference on Parallel Architectures and Compilation Techniques(PACT ’18)
Jul 2018
I/O Characteristics Discovery in Cloud Storage Systems, 2018 IEEE International Conference on Cloud Computing(Cloud ’18)
May 2018
AKIN - A Streaming Graph Partitioning Algorithm for Distributed Graph Storage Systems, 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid’18)
Research Seminar Talks
Nov 2020
Concurrent Metadata Indexing for Scientific Data Harvesting,
Aug 2020
Enabling High Throughput Concurrent In-Memory Metadata Indexing for Scientific Data Harvesting,
May 2020
A Data Retention Recommender System for HPC Facilities,
Mar 2020
A Recommender System for Promoting Scientific Research Collaboration and Data Sharing,
Nov 2019
On Scientific Data Discoverability,
Oct 2019
What Does a Bad Paper Look Like? Some Thoughts After A Paper Review,
Jun 2019
MIQS - Metadata Indexing and Querying Service for Self-Describing File Formats,
Feb 2019
Exploring Metadata Search Primitives for Scientific Data Management,
Nov 2018
Metadata Indexing and Search for Self-contained Scientific Data Management Models,
Sep 2018
Lightweight Metadata Search Service for Experimental and Observational Datasets,
Apr 2018
From Index to Metadata,
Feb 2018
Distributed Adaptive Radix Tree and Metadata Indexing,
Dec 2017
Distributed Keyword Search for Metadata,
Sep 2017
Towards Flexible and Efficient Metadata Search,
May 2017
Data Management for Large-Scale Graph-Oriented Applications,
Jan 2017
A Tutorial on CloudLa,
Nov 2016
Geospatial Data Mining on Spark and HBase,
Oct 2016
Similarity-Based Streaming Graph Partitioning for Distributed Graph Storage Systems,
Jun 2016
Similarity-Based Graph Data Placement Strategy for Graph-based Applications on HPC,
Mar 2016
An Online Graph Partitioner for Graph-Based Metadata Management System in High Performance Computing,
Dec 2015
Data Partitioning on High-Performance Graph Computing System - Motivation, Exploration and Innovation,
Video Presentations
Nov 2020
Activeness-based Data Retention Recommender for HPC Facilities (A 5-minute audio presentation for ACM Graduate Student Research Competition Poster at SC ’20),
Oct 2018
DART - Distributed Adaptive Radix Tree for Efficient Affix-based Keyword Search on HPC Systems (A 3-minute video teaser presentation for technical program presentation at PACT ’18),
Services
Panelist / Committee Member
2024
Committee Member
Committee Member Of The 9th International Parallel Data Systems Workshop (PDSW'24, Held In Conjunction With SC24)
2024
PC Member
Program committee member for the 36th International Conference on Scientific and Statistical Database Management (SSDBM '24)
2024
PC Member
Program committee member for the 37th International Conference for High Performance Computing, Networking, Storage and Analysis (SC '24)
2023
PC Member
Program committee member for the 24th IEEE/ACM international Symposium on Cluster, Cloud and Internet Computing (CCGrid 2024)
2023
DOE ASCR Panelist
Panelist for the Advanced Scientific Computing Research (ASCR) program of Department of Energy (DOE) funding opportunity - Distributed Resilient Systems.
2022
DOE ASCR Panelist
Panelist for the Advanced Scientific Computing Research (ASCR) program of Department of Energy (DOE) funding opportunity - Management and Storage of Scientific Data
2020
PC Member
Program committee member for the 27th edition of the IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC 2020)
Invited Paper Reviewer
-
IEEE International Transactions on Parallel and Distributed Computing Systems (TPDS)
-
The 38th IEEE International Parallel & Distributed Processing Symposium (IPDPS '24)
-
The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '23)
-
The 52nd International Conference on Parallel Processing (ICPP '23)
-
IEEE International Transactions on Parallel and Distributed Computing Systems (TPDS)
-
IEEE International Parallel and Distributed Processing Symposium (IPDPS)
-
IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGrid)
-
International Conference on Parallel Processing (ICPP)
-
IEEE International Conference on Cluster Computing (Cluster)
-
IEEE International Conference on Cloud Computing (CLOUD)
-
IEEE International Conference on BigData (BigData)
-
International Conference on Utility and Cloud Computing (UCC)
-
IFIP International Conference on Network and Parallel Computing (NPC)
-
International Parallel Data Systems Workshop (PDSW)
-
International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2)
-
IEEE International Workshop on High-Performance Big Data and Cloud Computing (HPBDC)
-
IEEE Open Access
Teaching
Experience
Graduate Courses
Advanced Parallel Computing
Invited Lecturer | Ohio State University
Parallel Computing
Invited Lecturer | Texas Tech University
Advanced Operating Systems
Course Project Designer and Mentor | Texas Tech University
Undergraduate Courses
Data Structures
Lab Instructor, Grader and Tutoring Session Host | Texas Tech University
Object Oriented Programming
Grader and Tutoring Session Host | Texas Tech University
Computer Architecture
Grader and Tutoring Session Host | Texas Tech University
Education
Texas Tech University
Aug 2014 - May 2021 | Lubbock, TX, USA
Ph.D. in Computer Science
Hebei University of Science and Technology
Sept 2003 - Jun 2007 | Shijiazhuang, Hebei, China
BSc in Computer Science
Skills
Programming Languages
C, Java, Python, Scala, NodeJS, Bash, Rust, C++
Server-side Development
Spring, RPC, AMQP, RESTful web service.
Databases
MySQL, Oracle, SQL Server, NoSQL - Memcache, Redis, HBase, MongoDB, Neo4J
BigData Tools
Spark, Hadoop, MPI
Cloud Computing
AWS experience, Docker
Operating Systems
Linux, Unix, Windows
Software Engineering
Design Patterns, UML, Continuous Integration.
Web Development
HTML, JavaScript, CSS, XML, XSTL, AJAX
Languages
English
Work proficiency
Mandarin Chinese
Native