Wei Zhang
Computer Science Researcher at LBNL
zhangwei217245 [at] lbl [dot] gov
Work Experience
Computer Science Researcher
Lawrence Berkeley National Laboratory
Mar 2023 - Now
  • Index-powered Distributed Object-centric Metadata Search (IDIOMS) - Led the IDIOMS project, overcoming complex technical challenges in distributed metadata indexing to enhance data management capabilities significantly. Fostered interdisciplinary collaboration, driving innovation and setting new performance benchmarks for metadata search efficiency. This initiative not only advanced the data analysis and management strategies of its parent project, but also underscored the potential for future developments in large-scale data handling. Contributed to my growth in distributed computing and leadership, reinforcing the Laboratory's mission in the advancement of scientific data management.
  • Evaluation and Enhancement of Metadata Indexing and Querying in Proactive Data Container - Led a comprehensive testing and benchmarking exercise on metadata retrieval and querying capabilities within the Proactive Data Container project. This process involved identifying and subsequently addressing performance gaps to optimize metadata retrieval and querying operations, demonstrating robust research acumen and professional competency.
  • Advancement in LLSM Multi-Dimensional Data Stitching - Led the design and execution of lattice light-sheet microscopy data processing applications, leveraging the Proactive Data Container infrastructure.
Senior Member of Technical Staff
Oracle Corporation
July 2021 - Feb 2023
  • Led the design and implementation of OCI Data Catalog Metastore Integration with OCI Big Data Service, overseeing various critical project aspects such as components orchestration, service integration, security, and test automation.
  • Orchestrated the design and optimization of the Active Directory Integration project for the AuthN/AuthZ modules in OCI Big Data Service, encompassing activities such as collecting use cases, conducting use case analysis, planning roadmaps, and developing proof of concepts for best practices.
  • Led the design and implementation of the UID/GID coordinating service for the Cluster, ensuring efficient and reliable coordination processes.
  • Spearheaded the design and implementation of the security management module within the Generic Kerberos and Active Directory Configuration Framework for OCI Big Data Cluster, strengthening the cluster management system.
  • Directed the design and implementation of external service integration framework in the Cluster Profile project of OCI Big Data Service, focusing on metadata-driven module access control on different cluster profiles.
Research Assistant
Data-Intensive Scalable Computing Laboratory (DISCL), Texas Tech University
Aug 2017 - May 2021
  • Developed and implemented an innovative solution for exploiting user activeness to optimize data retention in HPC Systems.
  • Designed and deployed a Metadata Indexing and Querying Service to efficiently handle self-describing data formats.
  • Created a Distributed Adaptive Radix Tree for affix-based keyword search, enhancing search capabilities in distributed systems.
  • Led the development of a successful NSF funding proposal, showcasing expertise in grant writing and research project management.
  • Mentored and provided guidance to a junior Ph.D. student, offering support and fostering their academic and research growth.
  • Presented a groundbreaking Similarity-based Streaming Graph Partitioning Algorithm for Distributed Graph Storage Systems at CCGrid '18 conference.
  • Conducted research and presented findings on the importance of Metadata Search Essentials for Scientific Data Management at HiPC '19 conference.
  • Two software releases:
    • ActiveDR - software release of SC '21 study (DOI: 10.5281/zenodo.5168853)
    • MIQS - software release of SC '19 study (DOI: 10.11578/dc.20210322.3)
Research Assistant
STARLab, Texas Tech University
Jan 2016 - Dec 2016
  • Development of a comprehensive data mining infrastructure leveraging Spark, HBase, and HDFS.
  • Execution of strategic optimizations, with a particular focus on unified data compression across the full spectrum of big data software stack.
  • Deployment of geo-spatial visualization for the distribution of social media users, employing GDAL in conjunction with NodeJS, Python, and Redis.
  • Conducted a thorough sentiment analysis on a dataset spanning five years of Twitter activity related to presidential election results.
  • Advanced demographic information extraction conducted through geo-spatial analysis of Twitter data utilizing technologies such as Apache Spark, HBase, and Hadoop.
  • Initiated a comparative study titled "Remote Sensing and Social Sensing for Socioeconomic Systems" examining the differences between Nighttime Lights and Location-based Social Media with a spatial resolution of 500 meters.
  • Undertook a project aimed at augmenting Nighttime Light Imagery through the incorporation of Location-Based Social Media Data.
  • Produced a comprehensive analysis titled "Tweets or Nighttime Lights - A Comparative Examination for Supremacy in Estimating Socioeconomic Factors".
Senior System R&D Engineer
Beijing Serious Technology Co., Ltd
Jan 2014 - Jan 2016
  • Designed and built Meshwork, a graph-like data access API, supporting both MySQL and Redis, for seamless and optimized data retrieval.
  • Designed and built BrookSide, a message processing framework for AMQP, specifically RabbitMQ, to enable efficient and reliable communication.
  • Led on the Webshot-rest-amqp-service project, a NodeJS application responsible for capturing website snapshots based on messages received from AMQP implementations like RabbitMQ.
  • Led the development of PCVF, a Parameter Constraining and Validating Framework for RESTful Web Service APIs, as part of a confidential project.
  • Guided DevOps practices involving Maven, Jenkins, Unit Testing, and a customized document generator to support RESTful Web Service APIs, ensuring compatibility with the PCVF framework.
System R&D Engineer
Sina.com Technology (China) Co.,Ltd.
Jul 2010 - May 2013
  • Optimized Weibo REST API for enhanced user experience, leading to the development of a BDD Testing Tool, specification for Weibo Open API documentation, and specification for Weibo Open API implementation.
  • Implemented T.cn, a URL shortening service, along with a program to track URL hits.
  • Managed the user data service for Weibo Open API, a critical data access path, ensuring high performance, availability, and adaptability to changing functionality.
  • Led the migration of data and services for User Service v2.0 within Weibo Open API, including the development of a distributed data service and message processing system.
  • Improved cache service performance for User Service v2.0 by conducting thorough analysis and reducing Memcache resource usage.
  • Designed and implemented a visualized service monitoring system to track the running status of the user service, including cache hit ratio, MySQL throughput, and critical user-related services such as Relationship Service and Feed Service.
Senior Software Developer
Beijing JustMusic Co.,Ltd.
Feb 2009 - Jun 2010
  • Spearheaded the end-to-end development of a sophisticated business data management system, leveraging software engineering expertise to ensure efficient data handling, storage, and retrieval.
  • Designed and implemented a streamlined batch processing framework, enabling seamless execution of data processing tasks and optimizing system performance for enhanced productivity.
Software Developer
Beijing Datuu.com Technology Co.,Ltd.
Jan 2008 - Jan 2009
  • Developed an operation management system, taking charge of routine feature development, data maintenance, and ensuring seamless integration of essential functionalities.
  • Implemented a robust business reporting module within the operation management system, enabling accurate and timely generation of business reports to facilitate informed decision-making
Publications
Journal Papers
2020
N. Zhao, G. Cao, W. Zhang, E. Samson, and Y. Chen. Remote sensing and social sensing for socioeconomic systems: A comparison study between nighttime lights and location-based social media at the 500 m spatial resolution. International Journal of Applied Earth Observation and Geoinformation.
2019
D. Dai, Y. Chen, P. Carns, J. Jenkins, W. Zhang, and R. Ross. Managing Rich Metadata in High-Performance Computing Systems Using a Graph Model. IEEE Transactions on Parallel and Distributed Systems.
2019
N. Zhao, W. Zhang, Y. Liu, E. Samson, Y. Chen, and G. Cao. Improving Nighttime Light Imagery With Location-Based Social Media Data. IEEE Transactions on Geoscience and Remote Sensing.
2018
N. Zhao, G. Cao, W. Zhang, and E. Samson. Tweets or nighttime lights: Comparison for preeminence in estimating socioeconomic factors. ISPRS Journal of Photogrammetry and Remote Sensing.
Conference Papers
2024
W. Zhang, H. Tang, and S. Byna. IDIOMS: Index-powered Distributed Object-centric Metadata Search for Scientific Data Management, in the Proceedings of The 24th IEEE/ACM international Symposium on Cluster, Cloud and Internet Computing (CCGrid 2024). (Accepted and in the process of publishing)
2023
C. Niu, W. Zhang, S. Byna, and Y. Chen. PSQS: Parallel Semantic Querying Service for Self-describing File Formats, in the 2023 IEEE International Conference on Big Data (BigData) (IEEE BigData 2023). ()
2022
C. Niu, W. Zhang, S. Byna, and Y. Chen. Kv2vec: A Distributed Representation Method for Key-value Pairs from Metadata Attributes, in the Proceedings of 2022 IEEE High Performance Extreme Computing Conference (HPEC '22). (acceptance rate: 30/120=25%)
2021
W. Zhang, S. Byna, H. Sim, S. Lee, S. Vazhkudai, and Y. Chen. Exploiting User Activeness for Data Retention in HPC Systems, in the Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '21). (first-around acceptance rate: 86/365=23.6%, another 13 papers being asked for major revisions per SC’21)
2019
W. Zhang, S. Byna, C. Niu, and Y. Chen. Exploring Metadata Search Essentials for Scientific Data Management, in the Proceedings of 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC '19). (acceptance rate: 23%)
2019
W. Zhang, S. Byna, H. Tang, B. Williams, and Y. Chen. MIQS: Metadata Indexing and Querying Service for Self-Describing File Formats, in the Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '19). (first-around acceptance rate: 72/344=21%, another 15 papers being asked for major revisions per SC '19)
2018
W. Zhang, H. Tang, S. Byna, and Y. Chen. DART: Distributed Adaptive Radix Tree for Efficient Affix-Based Keyword Search on HPC Systems, in the Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques (PACT '18). (acceptance rate: 36/126=28.6%)
2018
W. Zhang, Y. Chen, and D. Dai. AKIN: A Streaming Graph Partitioning Algorithm for Distributed Graph Storage Systems, in the Proceedings of 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID '18). (acceptance rate: 20.8%)
2017
D. Dai, W. Zhang, and Y. Chen. IOGP: An Incremental Online Graph Partitioning Algorithm for Distributed Graph Databases, in the Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '17). (acceptance rate: 19%)
2016
D. Dai, Y. Chen, P. Carns, J. Jenkins, W. Zhang, and R. Ross. GraphMeta: A Graph-Based Engine for Managing Large-Scale HPC Rich Metadata, in the Proceedings of 2016 IEEE International Conference on Cluster Computing (CLUSTER '16). (acceptance rate: 39/162=24.1%)
Software Releases
2020
W. Zhang, S. Byna, Y. Chen, National Science Foundation, USDOE, National Science Foundation, and National Science Foundation. MIQS v0.6 (link: https://www.osti.gov/biblio/1772233).
Thesis and Dissertation
2021
W. Zhang. Efficient scientific data discovery over self-describing file formats, Texas Tech University
Extended Abstracts
2020
S. Byna, Q. Koziol, H. Tang, W. Zhang, and Y. Chen. Searching metadata stored in self-describing file formats efficiently.
Posters
2020
W. Zhang. Activeness-based Data Retention Recommender for HPC Facilities, SC '20 ACM Graduate Student Research Competition Poster
2020
W. Zhang. Efficient Metadata Search for Scientific Data, SC '20 Doctoral Showcase Poster
2020
C. Niu, W. Zhang, S. Byna, and Y. Chen. Semantic Search for Self-Describing Scientific Data Formats, SC '20 Research Poster
2018
W. Zhang, H. Tang, S. Byna, and Y. Chen. Distributed Adaptive Radix Tree for Efficient Metadata Search on HPC Systems, SC '18 Research Poster
2018
W. Zhang, Y. Chen, and D. Dai. AKIN: A Streaming Graph Partitioning Algorithm for Distributed Graph Storage Systems, CCGRID '18 Research Poster
2017
D. Dai, W. Zhang, and Y. Chen. POSTER: IOGP: An Incremental Online Graph Partitioning for Large-Scale Distributed Graph Databases, PPoPP '17 Research Poster
Presentations
Conference Presentations
Nov 2021
Exploiting User Activeness for Data Retention in HPC Systems, the 33rd ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '21)
Nov 2020
Efficient Metadata Search for Scientific Data, the 32nd ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis (SC ’20)
Nov 2019
MIQS - Metadata Indexing and Querying Service for Self-describing Data Formats, the 31st ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis (SC ’19)
Nov 2018
DART - Distributed Adaptive Radix Tree for Efficient Affix-based Keyword Search on HPC Systems, the 27th International Conference on Parallel Architectures and Compilation Techniques (PACT ’18)
Nov 2018
Attributed Consistent Hashing for Heterogeneous Storage Systems, the 27th International Conference on Parallel Architectures and Compilation Techniques(PACT ’18)
Jul 2018
I/O Characteristics Discovery in Cloud Storage Systems, 2018 IEEE International Conference on Cloud Computing(Cloud ’18)
May 2018
AKIN - A Streaming Graph Partitioning Algorithm for Distributed Graph Storage Systems, 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid’18)
Research Seminar Talks
Nov 2020
Concurrent Metadata Indexing for Scientific Data Harvesting,
Aug 2020
Enabling High Throughput Concurrent In-Memory Metadata Indexing for Scientific Data Harvesting,
May 2020
A Data Retention Recommender System for HPC Facilities,
Mar 2020
A Recommender System for Promoting Scientific Research Collaboration and Data Sharing,
Nov 2019
On Scientific Data Discoverability,
Oct 2019
What Does a Bad Paper Look Like? Some Thoughts After A Paper Review,
Jun 2019
MIQS - Metadata Indexing and Querying Service for Self-Describing File Formats,
Feb 2019
Exploring Metadata Search Primitives for Scientific Data Management,
Nov 2018
Metadata Indexing and Search for Self-contained Scientific Data Management Models,
Sep 2018
Lightweight Metadata Search Service for Experimental and Observational Datasets,
Apr 2018
From Index to Metadata,
Feb 2018
Distributed Adaptive Radix Tree and Metadata Indexing,
Dec 2017
Distributed Keyword Search for Metadata,
Sep 2017
Towards Flexible and Efficient Metadata Search,
May 2017
Data Management for Large-Scale Graph-Oriented Applications,
Jan 2017
A Tutorial on CloudLa,
Nov 2016
Geospatial Data Mining on Spark and HBase,
Oct 2016
Similarity-Based Streaming Graph Partitioning for Distributed Graph Storage Systems,
Jun 2016
Similarity-Based Graph Data Placement Strategy for Graph-based Applications on HPC,
Mar 2016
An Online Graph Partitioner for Graph-Based Metadata Management System in High Performance Computing,
Dec 2015
Data Partitioning on High-Performance Graph Computing System - Motivation, Exploration and Innovation,
Video Presentations
Nov 2020
Activeness-based Data Retention Recommender for HPC Facilities (A 5-minute audio presentation for ACM Graduate Student Research Competition Poster at SC ’20),
Oct 2018
DART - Distributed Adaptive Radix Tree for Efficient Affix-based Keyword Search on HPC Systems (A 3-minute video teaser presentation for technical program presentation at PACT ’18),
Services
Panelist / Committee Member
2023
PC Member
Program committee member for the 37th International Conference for High Performance Computing, Networking, Storage and Analysis (SC '24)
2023
PC Member
Program committee member for the 24th IEEE/ACM international Symposium on Cluster, Cloud and Internet Computing (CCGrid 2024)
2023
DOE ASCR Panelist
Panelist for the Advanced Scientific Computing Research (ASCR) program of Department of Energy (DOE) funding opportunity - Distributed Resilient Systems.
2022
DOE ASCR Panelist
Panelist for the Advanced Scientific Computing Research (ASCR) program of Department of Energy (DOE) funding opportunity - Management and Storage of Scientific Data
2020
PC Member
Program committee member for the 27th edition of the IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC 2020)
Invited Paper Reviewer
-
IEEE International Transactions on Parallel and Distributed Computing Systems (TPDS)
-
The 38th IEEE International Parallel & Distributed Processing Symposium (IPDPS '24)
-
The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '23)
-
The 52nd International Conference on Parallel Processing (ICPP '23)
-
IEEE International Transactions on Parallel and Distributed Computing Systems (TPDS)
-
IEEE International Parallel and Distributed Processing Symposium (IPDPS)
-
IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGrid)
-
International Conference on Parallel Processing (ICPP)
-
IEEE International Conference on Cluster Computing (Cluster)
-
IEEE International Conference on Cloud Computing (CLOUD)
-
IEEE International Conference on BigData (BigData)
-
International Conference on Utility and Cloud Computing (UCC)
-
IFIP International Conference on Network and Parallel Computing (NPC)
-
International Parallel Data Systems Workshop (PDSW)
-
International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2)
-
IEEE International Workshop on High-Performance Big Data and Cloud Computing (HPBDC)
-
IEEE Open Access
Teaching
Experience
Graduate Courses
Parallel Computing
Invited Lecturer | Ohio State University
Parallel Computing
Invited Lecturer | Texas Tech University
Advanced Operating Systems
Course Project Designer and Mentor | Texas Tech University
Undergraduate Courses
Data Structures
Lab Instructor, Grader and Tutoring Session Host | Texas Tech University
Object Oriented Programming
Grader and Tutoring Session Host | Texas Tech University
Computer Architecture
Grader and Tutoring Session Host | Texas Tech University
Education
Texas Tech University
Aug 2014 - May 2021 | Lubbock, TX, USA
Ph.D. in Computer Science
Hebei University of Science and Technology
Sept 2003 - Jun 2007 | Shijiazhuang, Hebei, China
BSc in Computer Science
Skills
Programming Languages
C, Java, Python, Scala, NodeJS, Bash, Rust, C++
Server-side Development
Spring, RPC, AMQP, RESTful web service.
Databases
MySQL, Oracle, SQL Server, NoSQL - Memcache, Redis, HBase, MongoDB, Neo4J
BigData Tools
Spark, Hadoop, MPI
Cloud Computing
AWS experience, Docker
Operating Systems
Linux, Unix, Windows
Software Engineering
Design Patterns, UML, Continuous Integration.
Web Development
HTML, JavaScript, CSS, XML, XSTL, AJAX
Languages
English
Work proficiency
Mandarin Chinese
Native