Symposium on Open Source Investigation Labs: AI for Open Source Investigation

Symposium on Open Source Investigation Labs: AI for Open Source Investigation

[Guillen Torres Sepulveda is Open Source Investigations Specialist with the Human Rights Center, Berkeley School of Law.

Pınar Yolum is Professor of Trustworthy AI at the Department of Information and Computing Sciences at Utrecht University]

Open-source investigation is the practice of collecting, verifying, and analyzing publicly available information to answer investigative questions.  These questions vary from fact-checking to human rights monitoring to environmental abuse analysis. Established methods used for open source investigation include geolocation and chrono-location of images or videos, visualization of large volumes of data for analytics purposes, social network analysis of individuals or organizations, and tracking of vessels and flights. Typically, open source investigators perform these methods by collecting, systematizing and making sense of analog and digital data through structured methodologies. Increasingly, investigators are incorporating various software tools in the process. However, even then, the amount of available data makes application of these methods time-consuming and error-prone. In recent years, some Artificial Intelligence (AI) techniques have become mature and powerful enough to support investigators in carrying out a few of the more data heavy tasks. For example, AI techniques can process large amounts of open data (including text, images, video, audio geospatial data, etc.) much faster than humans, and can help investigators to identify and predict patterns effectively. We showcase some examples that may help those interested to explore the possibilities of AI in investigative work.

AI Techniques for Open Source Investigation

An important task in many open source investigation methods is to identify and match images with content of interest. Checking whether a picture found on social media shows a drone or whether a video of a protest contains any police officers are such examples. Helping investigators with these tasks would require the AI tools to be able to do sophisticated computer vision tasks. Image classification (e.g., checking if an object is in a photo), object location (e.g., identify the location in the image), facial recognition (e.g., matching known human faces to those in a picture), landscape matching (e.g., comparing photo backgrounds to satellite maps), reverse image search (e.g., matching an image with existing others to find information) are prime examples of computer vision task.  The underlying AI technique for many of these tasks is machine learning (ML), which enables algorithms to learn patterns from data. Simply put, most ML algorithms are first trained on data that have been labeled for the purposes of learning. This enables the ML algorithm to find a pattern based on the labels. When the ML algorithm is then shown a new image, the algorithm can classify what the label of the image will be. This technique is especially powerful when the training data is large and broad and the label of interest is well-defined. An example would be identifying a cat in an image. Since we can provide a large training data set of images, which contain images with cats, without cats, with cats of different sizes or types, the ML algorithm can derive a pattern accurately. Contrast this with identifying a novel weapon, where the training data would be extremely small and the new drone might have features that vary from existing drones. In such cases, ML algorithms are shown not to perform well and thus require extensive human oversight when used. The fact that the ML algorithm finds the pattern by itself, without explicit human instruction, has led them to be called black-box algorithms, which implies that it is difficult to see or check what is happening inside. Recent work on AI towards explainability aims to overcome this knowledge gap by providing additional methods and tools to end-users to support them in understanding and judging the actions an AI algorithm takes.

Another category of tasks in open source investigation require processing of large data sets. Messages collected from Telegram or texts of transcribed conversations are prime examples. In such a situation, it is helpful to be able to summarize text, understand links between messages, or sometimes translate them to another language. One category of algorithms is for analyzing text based on how often a certain term or entity is used, which other words it is used, and so on. Such algorithms are particularly useful for preprocessing large numbers of message texts to help investigators focus on the right messages or to give an understanding of how often certain words are being used to give an understanding of trending topics in the messages. Another important category is sentiment analysis, where the idea is to understand whether message writers have positive or negative sentiment about a given concept. This could help investigators understand the sentiment existing in the population about a topic, leading to further investigation if necessary.

Examples of AI use for Open Source Investigation

While the use of large datasets and computer-assisted research methods have been part of investigative journalism and human rights fact finding for decades, in the past ten years—and more prominently in the last five—they have become a staple of open source investigations. Journalistic outlets and human rights investigators (and a few hybrids, such as Forensic Architecture and its multiple spin-outs, or the now defunct Bellingcat’s Global Justice and Accountability Unit) have hired specialized personnel to combine complex data processing and visualization methods with the usual investigative and narrative methodologies of their fields. The mix has produced high impact investigations that span many of the possible practical applications of Machine Learning. One of the earliest examples was Buzzfeed News’ investigations into the construction of carceral infrastructure in China. Noticing that many facilities shared physical features which could be remotely observed, Buzzfeed’s investigators set out to experiment whether a machine learning model could be trained to find unidentified prisons when fed a large collection of satellite images. The New York Times visual investigations team, for example, trained computer vision algorithms to identify craters left by Israeli bombs in Gaza. The LA times trained another model to work with textual data and identify discrepancies in how crimes were classified using large datasets of local crimes. The non-governmental organizations VFRAME and Tech 4 Tracing use conflict footage to create weapon detection algorithms and produce ammunition replicas that can be used for investigative purposes. Bellingcat has created a tool that allows anyone interested to search across all local council discussions in Britain.  

What becomes evident from all of these examples is that AI helps investigators by allowing them to process data at a scale in which humans are not able to work, finding patterns or generating insights that are only possible when processing massive amounts of data. For example, just as anti-fraud software is able to flag suspicious transactions on the basis of historical records, investigators have identified potential spy planes, illegal mining in the Amazon.

Emerging Directions for AI in Open Source Investigation

An interesting direction where AI can be used more for open source investigation is agent-based modeling (ABM). ABM is based on modeling the world by creating agents to represent individual entities with various behaviors and interaction patterns. The model can then be simulated to observe how the interaction between agents lead to certain collective population behavior to emerge over time. ABM has been widely used to help policy developers in the public sector to understand how certain policy clauses will affect the behavior of individuals and whether the clauses will likely be accepted by the citizens. ABM can be used for open source investigation to help understand the spread of misinformation in certain networks as well as help understand the effect of various countermeasures. This would be useful to decide which countermeasures to invest in and how to put them into power. Another direction where ABM can be used is in helping humanitarian crisis monitoring. In addition to simply responding to needs, ABM can help simulate how humanitarian needs will evolve over time and across the affected regions and help identify actions to take before the needs become imminent.

Another interesting area is to use generative AI for restoration purposes. Many times open source images from conflict areas can be low-resolution, making it difficult to understand the background or identify the faces. Generative AI can be used to improve the quality of these images as well as reconstruct parts of it that are destroyed. Similarly, blurry satellite images can be cleaned to make it easier to use in an investigation. Such restoration techniques must, however, be applied transparently to avoid misrepresentation or distortion of evidence.

Ethical and Practical Concerns in the Use of AI for Open Source Investigation

Since ChatGPT and other similar tools broke into the mainstream in 2022, many voices have raised concerns about Generative AI’s impact on people’s critical thinking. While earlier versions of AI/Machine Learning have been spared most of this critique, maybe due to their more contained adoption, many of the concerns with AI-assistants now could be applied to some of the examples we have provided thus far. In both cases, investigators are outsourcing part of the analysis to computers. A key difference, however, is the level of auditability maintained when involving different types of AI: whereas most of the machine learning applications we exemplified thus far, as complex as they are, can be run multiple times to produce the same output, given the same input, this is not the case with generative AI, whose unpredictability keeps it out of any space in which replicability, full transparency, traceability and trust in provenance is relevant, such as open source investigations.

In that sense, investigators are perhaps safer engaging with generative AI only as an experimentation and to understand its capabilities in an effort to give proportion to the hype that fuels its uncritical adoption, so as to reduce the barrier of adoption later when the tech is ripe. Alternatively, non-technical investigators may benefit from using generative AI as a tool to reduce the technical challenge of implementing previous versions of AI and Machine Learning (i.e. non-generative ai).

Generative AI can also create new content (e.g., text or images) based on existing content, leading to deepfakes, where videos or audios of known individuals are generated to mislead the public to believe that the produced content is original, where the videos or audios are not real. At the same time, other AI tools have been used to detect these deepfakes using other ML algorithms that recognize signs of manipulation or generation. For example, recognizing inconsistencies in lighting in videos or misalignment in body features have been detecting signs of deepfakes by AI algorithms. Using the same technology, namely generative AI, it is possible to produce text in the form of formal reports or documents that look realistic and can mislead the reader to believe them. This enables individuals to create believable misinformation fast, leading the misinformation to spread quickly.

The integration of AI into open source investigation enable investigators to process vast and complex datasets, uncover hidden connections, and generate insights at unprecedented speed and scale. However, technological capability alone is not enough. Ensuring transparency, reproducibility, and ethical oversight remains fundamental to maintaining trust in AI-assisted findings. As investigators increasingly rely on algorithmic tools, combining computational power with human judgment, accountability, and contextual understanding will be key to preserving the integrity and credibility of open source investigations.

Print Friendly, PDF & Email
Topics
Artificial Intelligence, Featured, General, Symposia, Technology

Leave a Reply

Please Login to comment
avatar
  Subscribe  
Notify of