CIA Turns to Data Mining

The CIA, faced with a daily avalanche of information, is using new “data mining” technology to find useful nuggets within thousands of documents and broadcasts in different languages.

The CIA, faced with a daily avalanche of information, is using new “data mining” technology to find useful nuggets within thousands of documents and broadcasts in different languages.

The spy agency must sift through a barrage of information from both classified and unclassified sources in varied formats such as hard text, digital text, imagery, and audio in more than 35 languages.

The Office of Advanced Information Technology (AIT), part of the CIA’s Directorate of Science and Technology, is focused on finding solutions to the “volume challenge.”

“We’re not growing at a fast rate, but the amount of information that comes into this place is growing by leaps and bounds,” Larry Fairchild, AIT director, said in an interview this week in a basement demonstration room at Central Intelligence Agency headquarters.

“How do we give folks technologies so that they are able to handle the big increase in information they’re going to have to deal with on a day-to-day basis?” he said.

One computer tool called “Oasis” can convert audio signals from television and radio broadcasts into text.

It can distinguish accented English for greater accuracy in the transcription, whether the speaker is male or female, and whether one male or female voice is different from another of the same gender.

At the left of the screen of a transcribed broadcast are labels “Male 1,” “Female 1,” “Male 2,” next to sentences.

If one voice is labeled with a name, the computer from then on will put that name on anything else with that same voice.

So for example if a broadcast by Saudi-exile Osama bin Laden, whom the CIA considers a major threat to Americans, was transcribed and labeled, every time his voice was detected the computer would automatically label it.

MACHINE TRANSLATOR

If the machine translation appears off, the user can with a mouse click hear the actual broadcast. For example, the demonstration showed a transcription that read “latest danger from hell” but the audio said “latest danger from el nino.”

The computer cuts down on the time it would take a person to transcribe a half-hour broadcast to 10 minutes from up to 90 minutes, a CIA employee conducting the demonstration said.

The CIA is planning to have Oasis developed for different languages such as Arabic and Chinese.

It also finds similar meanings of words being searched, for example a broadcast might not mention “terrorism” but might say “car bombing,” which the computer would tag as “terrorism” so that anyone searching for that category would find it.

Currently the CIA’s Foreign Broadcast Information Service is using it in one Asian city and intends to have it in other regions such as the Middle East this year.

Another computer tool, “FLUENT,” enables a user to conduct computer searches of documents that are in a language the user does not understand.

The user can put English words into the search field, such as “nuclear weapons,” and documents in languages such as Russian, Chinese and Arabic pop up.

The system will then translate the document and if it is seen as useful, the analyst can send it to a human translator for more precision.

Languages that FLUENT can translate into English include Chinese, Korean, Portuguese, Russian, Serbo-Croatian and Ukrainian.

“Data mining” tools are used to extract key pieces of information from a variety of intelligence traffic such as on the flow of illegal drugs and also to keep track of illicit financial transactions.

Tools were developed to help CIA analysts on Iraq, who were asked to analyze the agency’s holdings on Iraqi war crime violations, about 1.2 million documents going back to 1979.

The Text Data Mining tool extracted and indexed all words in the data so for example if an analyst was asked whether Iraq ever used anthrax as a weapon, the analyst could open the tool and find anthrax in the automatically generated index.

That tool also counts the frequency of word use and can handle various spellings of the same Iraqi names or locations.

There is also “gifting technology” which gives the flavor of the key information of a document in a short paragraph, Fairchild said.

With the latest spy furor in the nation’s capital, would any of the tools help catch a spy?

“Yes, some of the things we’re doing can,” Fairchild said without details. “We’re looking at better technologies to put in that area,” he added.

Another intelligence official, on condition of anonymity, said: “If they have this kind of technology to plumb the depths of open sources, you can imagine what kind of technologies they have to track down spies.”

Author: Tabassum Zakaria

News Service: Reuters

URL: http://www.washtech.com/news/govtit/8057-1.html