Enriched Afghanistan War Diary Topic Map
(7 intermediate revisions by one user not shown) | |||
Line 1: | Line 1: | ||
− | This page is | + | This page is a duplicate of blog writing at [http://www.wandora.net/wandora/forum/viewtopic.php?t=56 Wandora Forum]. |
− | Thomas Efer of [http://www.topicmapslab.de Topic Maps Labs] released a topic map named ''Afghanistan War Diary - 2004''. Topic map | + | Thomas Efer of [http://www.topicmapslab.de Topic Maps Labs] has released a topic map named ''Afghanistan War Diary - 2004''. Topic map is based on Wikileaked documents and contains 2000 war reports from Afghanistan war. Reports are stored in topic map as unstructured text. The topic map is available at Topic Maps Labs' topic map repository, [http://maiana.topicmapslab.de/u/efi/tm/wd2004 Maiana]. Lutz Maicher [http://www.infoloom.com/pipermail/topicmapmail/2010q4/008539.html announced the release of the AWD topic map] at TopicMapMail email list. The announcement evoked an interesting discussion related to the quality of the topic map. The discussion inspired me trying to enhance the AWD topic map using Wandora's extractors. Wandora has several extractors that distill entities and keywords out of unstructured text. I selected extractors |
* [[AlchemyAPI extractors|Alchemy Entity extractor]] | * [[AlchemyAPI extractors|Alchemy Entity extractor]] | ||
* [[AlchemyAPI extractors|Alchemy Keywords extractor]] | * [[AlchemyAPI extractors|Alchemy Keywords extractor]] | ||
* [[OpenCalais classifier|OpenCalais extractor]] | * [[OpenCalais classifier|OpenCalais extractor]] | ||
− | * Yahoo! term extractor | + | * [[Yahoo! YQL term extractor|Yahoo! term extractor]] |
And applied each extractor, one by one, to all summary occurrences of report topics. Summary occurrence contains textual representation of the military event the report talks about. | And applied each extractor, one by one, to all summary occurrences of report topics. Summary occurrence contains textual representation of the military event the report talks about. | ||
− | Applying the extractor to occurrences generated more topics and associations. Extractors' outputs were stored to separate topic maps. As a result I got 4 different information packages, one for each extractor. All topic maps were | + | Applying the extractor to occurrences generated more topics and associations. Extractors' outputs were stored to separate topic maps. As a result I got 4 different information packages, one for each extractor. All resulting topic maps and the original topic map were zipped and this zip package is available '''[http://www.wandora.org/download/other/AWD2004_experiment.zip here]'''. In practice anyone can merge any of the generated topic maps with the original ''Afghanistan War Diary - 2004'' topic map and see extracted topics interleaving with the original topics. As the original summary occurrence is available, anyone can evaluate if the extracted topics and associations really describe the report. To summarize results: |
− | |||
* Yahoo term extractor found 3563 term topics (15092 associations). | * Yahoo term extractor found 3563 term topics (15092 associations). | ||
* OpenCalais found 1749 tags (3455 associations) in 18 tag classes and 25 topic categories (1317 associations). | * OpenCalais found 1749 tags (3455 associations) in 18 tag classes and 25 topic categories (1317 associations). |
Latest revision as of 16:15, 26 July 2012
This page is a duplicate of blog writing at Wandora Forum.
Thomas Efer of Topic Maps Labs has released a topic map named Afghanistan War Diary - 2004. Topic map is based on Wikileaked documents and contains 2000 war reports from Afghanistan war. Reports are stored in topic map as unstructured text. The topic map is available at Topic Maps Labs' topic map repository, Maiana. Lutz Maicher announced the release of the AWD topic map at TopicMapMail email list. The announcement evoked an interesting discussion related to the quality of the topic map. The discussion inspired me trying to enhance the AWD topic map using Wandora's extractors. Wandora has several extractors that distill entities and keywords out of unstructured text. I selected extractors
And applied each extractor, one by one, to all summary occurrences of report topics. Summary occurrence contains textual representation of the military event the report talks about.
Applying the extractor to occurrences generated more topics and associations. Extractors' outputs were stored to separate topic maps. As a result I got 4 different information packages, one for each extractor. All resulting topic maps and the original topic map were zipped and this zip package is available here. In practice anyone can merge any of the generated topic maps with the original Afghanistan War Diary - 2004 topic map and see extracted topics interleaving with the original topics. As the original summary occurrence is available, anyone can evaluate if the extracted topics and associations really describe the report. To summarize results:
- Yahoo term extractor found 3563 term topics (15092 associations).
- OpenCalais found 1749 tags (3455 associations) in 18 tag classes and 25 topic categories (1317 associations).
- Alchemy entity extractor found 424 entities (1285 associations).
- Alchemy keyword extractor found 9054 keywords (14684 associations).
Although the technical implementation of the experiment was easy, I am not really sure about the quality of automatic classifications provided by Calais, Alchemy and Yahoo. It is clear that the source material, reports of military actions, is very challenging due to military specific expressions, terms, and acronyms, and it looks like all classifiers have made false interpretations. It would be very interesting if someone would like to evaluate the overall quality of generated topic maps and point out typical error classes.
In any case, I hope this demonstration clearly shows a simple topic map storing merely occurrence data is not necessarily a dead end but a good start.