Hangzhou, Zhejiang, China
Offline Web Content Integration --- Thumbnail System
1. Developed a thumbnail subsystem for extracting thumbnail information from web pages for different vertical searches.
2. Reduced 40% database storage and 20% crawler resources by implementing a URL deduping algorithm. Speeded the algorithm 20 times by filtering top duplicate hosts and implementing Pyspark.
3. Completed the clean and refresh process of the system, using sm-db as database and sm-stream for data streaming; enabled automatic alerts to detect the inconsistency between DB and OSS cloud during the clean process. Webpage finally updates thumbnail sub-contents at a delay of fewer than 2 hours.
4. Implemented multi-business control, logging, and data statistics on the selection process of the thumbnail subsystem.
5. Built a dashboard on metrics of crawling delays and business statistics by MaxCompute to monitor the health of the system.
Offline Web Content Integration – Quark Medical Data Process
1. Migrated Quark Medical enter-database process from local servers to the cloud.
2. Used Blink for stream computing, Lindorm as database, and DataX for building ETL pipeline. Wrote Blink plugins by Java to extract medical atlas entity and relation from session data between patients and Quark AI doctor.
3. Utilized tricks such as Session Window and Streaming Joins to optimize the enter-database system, which finally can process data up to 1000qps, improved by 50% compared to the previous system.
Political Sensitivity Analysis
1. Conducted model integration of Fasttext and Bert for politically sensitive webpage classification. Increased the recall by 20%; wrote an online C++ plugin based on the model and published it to the online process, proved to be successful.