The Information Laundromat

About

The following sections provide information on how the Information Laundromat tool works, how to use it effectively, and how to interpret results.

The Laundromat

The Laundromat tool provides two core functions: Content Similarity Search and Metadata Similarity Search:

Content Similarity Search takes a user-selected URL, title, and/or text snippet and uses GDELT, a variety of search services, and a plagiarism checker (assuming the total queried text is > 15 words) to detect URLs that contain some degree of similarity with the queried content. The user may also specify a country and language to search in. As not all languages and countries are supported by each service, the default is the United States and English if unsupported. Finally, users may specify which search engines/services they want to use for their search.

Content Similarity Search attempts to find similar articles or text across the open web. It does not provide evidence of where that text originated or any relationship between two entities posting two similar texts. Determination of a given text's provenance is outside the scope of this tool.

URL Search

Enter the full URL of an article or webpage (e.g. https://tech.cnn.com/article-title.html or https://www.rt.com/russia/588284-darkening-prospects-ukraine-postwar/) to automatically extract title and content. This feature will not work with every website.

Advanced (Title/Content) Search

This search allows users to specify the title and content (and apply boolean ANDs/ORs to the title and content). It also requires specifying a country and language to search in. As not all languages and countries are supported by each service, these will default to US and English if unsupported. Finally, users may specify which search engines they want to use for their search.

Batch Search

To search multiple URLs at once, the Laundromat allows users to upload a list of URLs in CSV format. To access this feature, contact us at info [at] securingdemocracy.org to obtain a registration code.

Interpreting Results

A content search will produce a searchable list of links, their domains, possible associations with known lists (see below for more information), the title and text snippet, the search engines where that link was found, and the percentage of the title or snippet that matches the provided input. Because this method leverages search results, there are articles that surface that share some similarities with the queried text but that are fundamentally different. To improve the accuracy of results, we use gestalt string matching (also known as Ratcliff/Obershelp pattern recognition), a technique to determine the similarity of two pieces of text (“strings”) on their common substrings to determine the similarity between the queried text and the surfaced article. This technique is useful in cases where a piece of text may have been lightly edited or words inserted or removed, as often happens with headlines and articles. A score of 100% indicates a complete match between the queried text and a result, while a value of 0% indicates no match. While this scoring method is very accurate when querying a snippet of text, it is less accurate when querying URLs because websites often contain sidebars or other text on the page that is different from the original source, even if the article itself is identical. The information laundromat tool may therefore produce lower similarity scores when querying URLs than the strength of the match would otherwise suggest. The accuracy of results is also dependent on the length and uniqueness of the queried text. Searching for a well-known name or common phrase will likely produce high match scores but poor results. For example, searching “Xi Jinping” will produce many URLs with 100% match scores, but likely few of them will be relevant. Regardless of match score, we urge users to manually confirm results.

Metadata Similarity Search attempts to find aspects of a website which indicate what makes it unique, give insight into its architecture/design, or show how its used/tracked. These indicators are compared for items with high degrees of similarity and matches are provided to the user. This search feature will accept a list of one or more fully qualified domain names (user must include a prepended https:// on each domain name). This will produce a list of indicators and a list of sites that match (or are extremely similar to) those indicators. Indicators, and thus matches, are broken into the three tiers described below.

About the Indicator Tier System and Interpreting Results

Each indicator is associated with an evidentiary tier and is subject to interpretation.

Tier 1 Indicators: WHEN VALID are typically unique or highly indicative of the provenance of a website. This includes unique IDs for verification purposes and web services like Google, Yandex, etc as well as site metadata like WHOIS information and certification, WHEN VALID, as DDOS protection services like Cloudflare and shared hosting services like Bluehost can provide spurious matches.

Tier 2 Indicators: WHEN VALID, these offer a moderate level of certainty regarding the provenance of a website. These are not as unique as Tier 1 indicators but provide valuable context. This tier includes IPs within the same subnet, matching meta tags, and commonalities in standard and custom response headers.

Tier 3 Indicators: WHEN VALID, these are the least specific but can still support broader analyses when combined with higher-tier indicators. These include shared CSS classes, UUIDs, and Content Management Systems.

Interpreting Indicator Validity

Understanding the validity of indicators is crucial in the analysis of websites' provenance and connections. Indicators can range from high-confidence markers of direct relationships to spurious matches that may mislead investigations. It is essential to approach each indicator with a critical eye and corroborate findings with additional evidence.

High Confidence Indicators:

  • Unique IDs for verification purposes: These are often excellent evidence of a connection or shared ownership, such as unique Google Analytics IDs that directly link websites to the same account.
  • Domain Certificate sharing: When websites share a specific SSL certificate, it often (but not always, see below) indicates a direct relationship, as certificates are typically issued to and managed by the same entity.

Discovering two websites with the same unique Google Analytics ID AND a shared, specific SSL certificate suggests a high-confidence link, indicating shared management or ownership.

Spurious Matches:

  • Using services like Cloudflare: While Cloudflare and similar DDOS protection services offer valuable security benefits, they also mask true IP addresses and distribute shared SSL certificates across multiple sites. This can lead to false positives in linking unrelated websites based on shared IP addresses or certificates.
  • Shared hosting services: Websites hosted on shared services like Bluehost may share IP addresses with hundreds of unrelated sites, making IP-based matches unreliable without further context.

Identifying that multiple websites are behind Cloudflare does not inherently indicate a connection beyond choosing a common, popular service for performance and security enhancements. All tier 1 and 2 indicators should be scrutinized carefully to determine if a match is valid or spurious.

Example Investigation:

An analyst investigating a network of disinformation websites notices that several sites share a specific Facebook Pixel ID, indicating a potential link in their online marketing strategies. This Tier 1 indicator suggests a high-confidence connection. However, upon further investigation, it's revealed that these sites also use Cloudflare for DDOS protection, sharing SSL certificates and IP addresses with numerous unrelated sites. While the shared Facebook Pixel ID remains a strong indicator of connection, the shared certificates and IP addresses through Cloudflare are deemed spurious matches and the additional sites are discarded from the network. The analyst corroborates the initial finding with additional Tier 1 indicators, such as unique verification IDs, solidifying the connection between the sites beyond the spurious matches introduced by shared security services.

In interpreting indicator validity, analysts must weigh the evidence, seek corroboration, and consider the broader context to distinguish between high-confidence connections and potentially misleading, spurious matches.

The Domain Forensics Comparison Corpus

Any URLs entered into the Metadata Similarity Search tool are compared against a list of domains already processed by the tool. This corpus is sourced from a number of sources, including:

Inclusion in the corpus of comparison sites is neither an endorsement nor a criticism of a given website's point of view or their relationship to any other member of the corpus. It solely reflects what websites are of interest to OSINT researchers. If you'd like a website removed from the list or have a potential list of new items to include, email info (at) securingdemocracy.org.

Partners, Sponsors, Disclaimers

The Laundromat Tool is made possible with the support of the European Media and Information Fund (EMIF). The Information Laundromat Tool is built by a partnership of the Alliance for Securing Democracy (ASD), the Institute for Strategic Dialogue (ISD), and the University of Amsterdam (UvA) through the Digital Methods Institute.

Disclaimers

Opinions Disclaimer

The sole responsibility for any content supported by the European Media and Information Fund lies with the author(s) and it may not necessarily reflect the positions of the EMIF and the Fund's Partners, the Calouste Gulbenkian Foundation and the European University Institute.

GDPR Disclaimer

The Information Laundromat tool is committed to protecting and respecting your privacy in compliance with the General Data Protection Regulation (GDPR). This disclaimer outlines the nature of the data processing activities conducted by our tool and your rights as a data subject.

Data Collection and Use

The Information Laundromat tool collects data through two forms, as part of its functions: Content Similarity Search and Domain Forensics Matching.

  • Content Similarity Search: This function processes URLs and text snippets provided by the user to detect occurrences of the given text across various websites. It is important to note that the provenance of the text and the relationship between entities posting similar texts are not determined by this tool.

  • Domain Forensics Matching: This function processes a domain URL and analyzes aspects of website architecture, design, and usage to identify unique indicators. It compares these indicators across websites to find high degrees of similarity and provides indicators and match results to the user.

Purpose of Processing

The form data and results are collected and are solely used for the purpose of usage analytics and potential corpus expansion.

Data Subject Rights

Under GDPR, you have various rights concerning the processing of your personal data, including:

  • The right to access your personal data.
  • The right to rectification if your data is inaccurate or incomplete.
  • The right to erasure of your data ("the right to be forgotten").
  • The right to restrict processing of your data.
  • The right to data portability.
  • The right to object to data processing.
  • The right to lodge a complaint with a supervisory authority.

Please note that exercising some of these rights may impact the functionality of the tool in relation to your use.

Data Security and Retention

We implement appropriate technical and organizational measures to ensure a level of security appropriate to the risk of the data processing activities. Data is retained only for as long as necessary for the purposes for which it was collected.

Contact Information

For any inquiries or requests regarding your data rights, please contact our data protection officer at pbenzoni (at) gmfus.org.

By using the Information Laundromat tool, you acknowledge that you have read this disclaimer and agree to the processing of your data as described herein. If you do not agree with these terms, please do not use the tool.

This disclaimer is subject to updates and modifications. Users are encouraged to review it periodically.