Phishing attacks constitute a significant threat to Internet users, pushing various entities to rely on blocklists to protect from malicious traffic. Unfortunately, community-driven and automated methods for constructing these lists may occasionally result in false positives, erroneously flagging benign domains or URLs as malicious. This blog discusses how we address this problem.
For the details, we refer readers to our research paper published at the Symposium on Electronic Crime Research (eCrime), Spain, 2023: Building a Resilient Domain Whitelist to Enhance Phishing Blocklist Accuracy.
Blocklists are compiled sets of URLs or domain names associated with malicious activities or exhibiting indicators of malicious behavior. The lists are often sourced from reputable third-party organizations, such as the Anti-Phishing Working Group (APWG) or OpenPhish. Alternatively, they can come from community-driven platforms like PhishTank or are directly integrated into web browsers (e.g., Google Safe Browsing and Microsoft SmartScreen). Due to the large volume of malicious activities on the Internet, such lists are created automatically and may suffer from false positives. Therefore, reliable whitelists are needed to increase the accuracy of blocklist feeds.
The core of our methodology is the list of 8.9 billion candidate domain names that are likely to be classified as malicious in various blocklists.
We first compile a set of 305 labels associated with 338 domains of high-profile brands that are frequently targeted by phishing attacks. Our assumption is that those reference domains are very unlikely to be compromised at the DNS or website levels. Next, we use a tailored version of dnstwist to generate 1.7 million variants of those strings, encompassing homographs, typo-squatting, bitsquatting, homophones, and combo-squatting using 28 selected keywords. Each string is subsequently appended to every top-level and second-level domain found in the Public Suffix List, omitting 86 non-delegated and catch-all suffixes that respond with the NOERROR DNS response code even for non-existing domain names.
Below we discuss 4 different methods we used to construct a reliable domain name whitelist with 17,215 entries that relies on publicly-available datasets. The candidate domains generated for Amazon, Google, Microsoft, Instagram, and Facebook together account for 32.3% of the whole whitelist with 1,377 (8%), 1,190 (6.9%), 1,105 (6.4%), 981 (5.7%), and 911 (5.3%) domains, respectively.
Domains that resemble trademarks or personal names may lead to legal disputes known as “cybersquatting”. To simplify the resolution process for trademark owners and avoid lengthy lawsuits, ICANN introduced the Uniform Domain-Name Dispute-Resolution Policy (UDRP). Trusted legal organizations, such as the World Intellectual Property Organization (WIPO), act as dispute resolution service providers with panelists from different countries. We extracted 135,043 unique disputed domains from 4 dispute resolution service providers that were i) transferred to the complainant and ii) for which we could collect the decision date. We further narrowed down the list to only include domains of high-profile brands identified in the previous step resulting in 7,405 entries to the whitelist.
To illustrate this approach, PayPal filed a complaint with the ADR Forum in November 2022 regarding the ppbpaypal.com domain. The panelist ruling resulted in transferring the contested domain to PayPal. At the time of writing, this domain continues to be defensively registered and remains under the supervision of PayPal, making it a valuable addition to our whitelist.
We identified 9 reputable defensive registrars (MarkMonitor, CSC Corporate Domains, Com Laude, RegistrarSEC, Safenames, Nameshield, IP Twins, SafeBrands, Hogan Lovells) that collaborate with high-profile domain companies and register domains on their behalf. We then retrieved the WHOIS data for 2.5 million active candidate domains and specifically extracted the registrant organization name. Domains enter the whitelist if the registrant organization in WHOIS corresponds to one of the original values in our collected list of reference domain names, and the registrar's IANA ID and registrar name indicate one of the nine defensive registrars. This method added 3,979 domains to the whitelist.
Companies may use the same authoritative name servers for their primary, alternative, localized, defensive, and other domains to streamline and simplify the domain management processes. We queried NS records of 338 brand domains, further keeping only those where name server domains were under the corresponding reference domains. We then requested NS records of all the 8.9 billion candidate domains and checked the overlap with the nameservers of reference domains. As the final sanity check, we resolved candidate domains directly at overlapping nameservers, to ensure those are indeed authoritative for domains that claim so. This method contributed 1,579 domains to the whitelist.
Transport Layer Security (TLS) certificates are additional sources for potential inclusion in the whitelist. We gathered the certificates for the 338 reference domains with our TLS module and obtained a dataset of 6,946 entries, including 6,071 fully qualified domain names (FQDNs) and 875 wildcards covered by those certificated. Due to the inability to assess the significance of all domains covered by wildcards, we exclude them from the dataset. We then take the corresponding 8.9 billion candidate domains and match them against the list of domains allowed by certificates of brand domains. This method contributed 47 domains to the final whitelist.
We applied our whitelist to existing blocklists to demonstrate its effectiveness. We gathered URLs and domains that appeared between August 2022 and August 2023 in three phishing blocklists: APWG, OpenPhish, and PhishTank. We then selected the URLs for which their FQDNs or registered domains appeared in the whitelist. Overall, we found an overlap of 73 domains with APWG, representing 1.1 million URLs. This important number of URLs comes from a small number of domains that generate a unique URL when the APWG automated system visits a given page. An example of such a domain is absabank.mu with over a million blacklisted URLs. This behavior has been observed for phishing web pages that try to avoid detection systems by creating a unique URL on each victim's visit. Such behavior can also be observed on the official login pages of some financial institutions which, in combination with a benign domain wrongly classified as malicious and with an automated approach, results in an important number of false positive URLs. The number of false positives for other blocklists is less significant.
We also applied Google Safe Browsing to our whitelist, which resulted in 21 domains marked as a threat. We manually investigated these domains and found that 19 out of 21 domains went through a UDRP dispute process. However, all of them were already transferred to the complainant at the time of the measurement, indicating a false positive in GSB. Such cases reveal a common problem of blacklists when a domain name is correctly identified as malicious at the time of blacklisting but remains on the blacklist even after the domain does not represent a threat anymore, i.e., is transferred to the trademark owner as in the reported cases.