Choosing pseudo-random starting points on each domain

Showcase, discuss, and inspire with creative America Data Set.
Post Reply
kexej28769@nongnue
Posts: 396
Joined: Tue Jan 07, 2025 4:32 am

Choosing pseudo-random starting points on each domain

Post by kexej28769@nongnue »

The next step was to randomly select domains with a bias toward sites larger than 10,000. When the system selects a site, it randomly selects from the top 100 pages collected by Google from that site. This helps reduce importance bias a bit more. We don’t always start with the home page. While these pages are important pages on the site, we know that they aren’t always the most important page, which is the home page. This was the second step in reducing known bias. The lower-quality pages on larger sites were balancing out the inherent bias in the Quantcast data.

4. رینگنا، رینگنا، رینگنا
And this is where we make our biggest change. We actually crawl the web starting from this set of pseudo-random URLs to generate a real set of random URLs. The idea here is to take all the bahrain number data we built into the pseudo-random URL set and let the crawlers randomly click on links to generate a truly random URL set. The crawler will choose a random link from our pseudo-random crawl set and then randomly start clicking on links, with a 10% chance of stopping each time and a 90% chance of continuing. Wherever the crawler ends up, the final URL is added to our list of random URLs. This is the final set of URLs that we use to run our metrics. We generate about 140,000 unique URLs per month through this process to generate our test data set.


Phew, now what? Defining the matrix
Once we had a random set of URLs, we could really start comparing link indexes and measuring their quality, quantity, and speed. Fortunately, in my quest to “get it right,” Moz generously gave me paid access to competing APIs. We started by testing Moz, Majestic, Ahrefs, and SEMRush, but eventually abandoned SEMRush after partnering with Majestic.

So, what questions can we answer now that we have a random sample of the web? This is exactly the wish list I sent in an email to the Link Project leaders at Moz.

Size:
W
Post Reply