Lark Wiki Crawler
Tasks
In this article, I will share my journey in designing and idealising a a multi thread crawler in GoLang, which is used to scrapper and aggregate the internal documentation for TikTok. (Internal sensitive information are omitted, just the thought process of design and some code / code logics here and there).
Firstly, let me briefly explain the problem space. I was given root document, and inside of the root document, there are list of child document connected to it. The overall structure of the document space mimics a nested folder structure (tree structure). Where I need to extract the metadata from each document in the file tree, and export them to a csv ideally.
When taking look at this problem initially, its like any other regular scrapper right? We can simply use a headless scrapper like Puppeteer (JavaScript) or Beautiful Soup (python) to crawl.
Here begins my first failed attempt for writing the scrapper:
First Attempt: Headless browser
This solution is relatively straight forwards. For each document, there is a "more info" button, which have the same id across all document page. I just need to mimic the behaviour of human clicking on the button, and extract all of the necessary information within the popup modal, and store them in a CSV.
Then, I can find the section which contains the link of the documents that are linked to the current document. For each child doc, append them in a list for future consumption.
From here, we can simply do a BFS or DFS algorithm on the doc list until there is no more left. Which indicates that we have processed all documents.
But this is not as easy as it seems:
Roadblock 1:
When I started to use Puppeteer to crawl, I realised that the website (TikTok internal doc) requires authentication. Therefore every time I run the script, it requires scanning QR code for authentication.
This has significantly slowed down the progress of development, therefor I have to dive into Puppeteer's documentation to find a solution.
It turns out, the solution is straight forwards. Which open a seperate Terminal and create a new browser instance, and run this instance on the background. Then, we can extract the end point of this instance to be used in our scrapper:
// Browser Instance
(async () => {
const browser = await puppeteer.launch({
executablePath,
headless: false,
args: ["--start-maximized"],
});
console.log("Browser wsEndpoint is: ", browser.wsEndpoint());
})()
// The srapper instance
const browserWSEndpoint = "ws://127.0.0.1:61090/devtools/browser/9b2f06b3-874c-4941-a1b7-b76cebdeca97";
const defaultViewport = {
width: 1600,
height: 900,
};
const scrape = async () => {
try {
const browser = await puppeteer.connect({
browserWSEndpoint,
defaultViewport,
});
}
}
Ok now the overhead is out of the way, we can keep crawling. But, another issue floats above the water.
Roadblock 2:
After spending some time trying locate the html section on the document that contains the children, I find out most of the document are disconnect, where their child document dose not appear within the page. (Its like notion, which contains different blocks, and most pages does not have a block of child documents).
Then I realised instead of crawling the page, I can craw the page tree on the left sidebar.
![[qDB9BLufTee39t3JI35icFIy28s.avif]]
This becomes a very easy list traversal suddenly, until I realised that the list is dynamically rendered, and there is no way that I can traverse through the list with a fixed index (index is keep changing, which removes doc out of sight, and render new docs).
Therefore, I decided to just use a set to record the unique identifier of the document, then skip over the document already processed.
try {
...
const page = (await browser.pages())[0];
const processedTitles = new Set(); // To track processed items
const csvWriter = ...
while (true) {
// Re-fetch the menu items
const navs = await page.$('.list-item-wrapper');
let i = START_INDEX;
while (i < navs.length) {
const title = await navs[i].$eval('.tree-node-wrapper', (ele) => ele.innerText);
// Skip already processed items
if (processedTitles.has(title)) {
i++;
} else {
break;
}
}
// Process the new documents and titles
// Then, scroll to the next section of the side menu
}
};
Although this is somewhat inefficient, since it have to loop through the processed items. But even without this inefficiency, due to the scale of documents to scrap, Puppeteer's way of mimic human browser interaction would significantly slow down the process.
Hence, I must find a quicker way.
Second Attempt: Http based crawler
After some research, I realised making direct http call with cookies would be much more efficient, and we just have to process the json response to obtain the useful informations.
Therefore, this job can be handled by a simple http crawler in GoLang.
From a brief inspection on the network fetch activities, I found two get request. One returns all relevant information about a document, and the other return all children document and their token for a document. And all required parameters are just cookie and document token (which is in the link directly).
Then, a simple DFS traversal of all the documents starting from the root, can capture all information required in a csv.
But, after I run the scrapper across out internal documents, it took over 50 minutes. Which is way of the required time. Ideally, this process should by under 5 min.
Since I don't see a way to reduce the time of http request, there is only approach left, concurrency.
Final attempt: Crawler with concurrency
Now, we got to the most interesting part of this blog.
However, in order to idealise a concurrent http crawler, I must understand how to traverse through a tree concurrent. Also, the solution should have limited number of threads and works, so that the program does not go over the theoretical and practical limit (such as CPU), and limit the number of http request made per second to avoid rate limit and potential flag of threat by the internal document website.
After initial think, two approach comes to mind.
- Divide and conquer traversal, which split up tree into sub trees, assign one to each thread, and collect the result at the end.
- Breadth first search. Start from the root node, traverse through the tree layer by layer by continuously adding node to a thread same queue, and consume from that queue at the same time.
The first approach is more complicated, since I need to think of a method of dividing the tree into sub trees with similar sizes. Or implementing the work stealing across threads. (That is whenever a worker becomes idle, it will "steal" word from others).
So, here is the structure of my concurrent scrapper in GoLang.
- Keep a thread safe queue (channel in go), to store the node/document to process later
- There a 10 works listening to the queue, and taking node to process them if the queue is not empty.
- After a thread processed the node and recored the info, it will append all the child node. (the csv buffer is also thread same, so that multiple thread can append to it)
Now, the program can be completed in around 5 mins with 10 workers, which is quite efficient to operate.
Although the program can finished operation, the producer will idle indefinitely after all document been process. I have to figure out a method to exist the program peacefully after running.
Initially, I though I can exist the program if the queue is empty, and all worker id idling.
However, this can cause an early shutdown if not carefully managed. For instance:
- Worker A pulls a node from the queue.
- Worker A processes the node, finds child nodes, and is about to push them into the queue.
- The main routine checks the queue and sees it’s empty in that split second, sees all workers are “idle” (or at least not currently blocking), and shuts down.
- The child nodes never get enqueued, and the program terminates prematurely.
Therefor, I have implemented a workaround for this solution, which sets a time interval initially, and if all worker idle time exceed the time interval, then exit peacefully.