Modern Technology of Content-based Retrieval
in Digital Visual Archives.

N.S. Baigarova, Y.A. Bukhshtab, N.N. Evteeva
Keldysh Institute of Applied Mathematics, Russian Academy of Sciences
Phone.: (095) 250-78-15
E-mail: kikom@glasnet.ru

1. How to provide access to visual archives

Modern museums, libraries and archives tend to form digital visual archives, including video images in order to preserve the heritage of the past and make it accessible from afar. This calls for developing new information technologies.

The digitising and storage of large amounts of visual material is no longer a problem from a technical point of view. What we urgently need today is to provide a content-based retrieval of relevant information from digital visual archives.

Until recently search for visual information has been traditionally based on text information associated with images. To present a visual archive the authors have developed a full-text retrieval search engine which is effectively used both on a CD and in the Internet. The software application gave rise to an electronic film catalogue in the Russian State Archives of Film and Photo Documents which is updated on a regular basis (the implementation of this project is supported by the Open Society Institute).

Within a system to be built with these tools an image (or a film) is presented in the form of a document to be retrieved by any word in it. A document structure is context-based for a search to be restricted by one or several of their components which are of interest to a user. For example, the components in question for the catalogue of films in the Russian State Archives of Film and Photo Documents comprise: title, authors, date, studio, subject matter, description of content, colour, sound and film parameters, etc. A user gets requested documents either in full or in a form to be looked through with a certain selection of description components. It is possible to include into a document some images or video clips to be looked through.

Since there is an urgent need to provide access to a visual collection by means of searching text information associated with images, the above approach does not seem satisfactory. Inconsistencies between visual content and text description reduce quality and effectiveness of a search. Information a human being gets from looking at a picture is “worth” thousands of words. Certain images are hard to be described in words (abstract pictures are what immediately come to mind as an example).

A problem arises as to how to provide access to modern digital visual archives using a whole set of tools - text descriptions and visual parameters from very simplest like colour range to more sophisticated like domain objects. A text description and visual information required for a search complement each other making a comprehensive search possible. A request for information in this case can be formulated in words, perhaps, with an indication of an appropriate context, or in terms of visual parameters. It can also be made as a combination of key words and a description of a visual content. A search can take an iterative form: first, a search based on key words (the fastest way), then the resource-intensive search by visual parameters.

Today scene understanding and object recognition techniques are applied to only narrow domains as efficient and all-purpose algorithms are not available. A modern valuable tool in the visual information management in the ability to extract and index the visual content using a set of visual features (colour histograms, shape parameters, texture measures, and motion characteristics for video) automatically [ 6, 12, 13, 14] .

The reported research data is intended to develop and implement techniques to analyse, index and search for images and video data on the basis of their visual attributes and is relative to automated comprehension of visual information.

2.Visual features and search-by-example

Visual features are image properties that can be computed by digitised visual data, make possible their effective indexing and processing of requests by visual attributes of an image. A search pattern of an image generated from visual features is small in size as compared with the original and handy for search. Retrieval is based on using similarity measure. The search-by-example is the most suitable approach for retrieving images: user provides an entire image as a template and the system finds images that are visually similar. As the system computes image features, the main task in processing a similarity query is to compare these features and come up with some measure of how they differ. Similarity for the template image and each target image is defined as overall similarity made up of similarity measures for all visual features. So target images can be ranked in order of their similarity to the template image. A search at this level of abstraction does not suggest object identification. If we take as a sample, say, the image of a dog, then the system will search for images similar to the sample in colour, layout, certain shapes, etc. Yet, there is no guarantee that the image of this particular animal would be found. However, today a search-by-example on the basis of visual features seems to be a sufficiently effective and universal means to get access to collections of digitised images.

Many researches and the authors of the paper among them have accumulated experience in implementing algorithms that calculate and compare visual features.

3. Indexing and search for images

Colour perception for humans is hard to overestimate. Hence the importance of visual information search techniques on the basis of similar colours. Colour histograms are the most popular. Mean and dominant colours, as well as colour sets can be used; these features are good to index image areas locally [ 9, 17] .

The idea of using colour histograms for indexing and comparison of images boils down to the following: the entire variety of colours is split into a set of non-overlapping subsets that completely embraces the set. A histogram is formed to represent the proportion of each subset in the colour range of a given image. To compare histograms a distance between them is a parameter to be taken into account. Various techniques to build colour histograms and compare them are available [ 1, 2, 7, 17] , they differ in initial colour mode, histogram dimension and distance. The authors have practised several versions of this technique with different splitting of a set of colours into subsets, and with different ways of calculating distance between histograms.

When RGB-colours are divided by intensity, the intensity of each colour is determined by its red, blue and green components. An obtained value somewhere between 0 and 225 falls into one of 16 equal intervals within a range of likely values. Absolute values of difference between appropriate histogram elements taken as a sum are used to represent a distance between histograms. The technique somewhat improves if distance is calculated on the basis of comparing histograms on an element-by-element basis with regard for adjacent elements. This technique is most effective for greyscale images. For colour RGB-images better results are obtained if another technique is employed - splitting of RGB-colours into right parallelepipeds.

Colour RGB-space is viewed as a three-dimension cube, with each axis corresponding to one of three principle colours (red, blue or green), the axes are marked from 0 to 225 (a higher value corresponds to greater colour intensity). Viewed in this manner any colour of RGB-image can be represented as a point of the cube. To build a colour histogram each side is divided into 4 equal intervals, a RGB-cube is divided into 64 right parallelepipeds accordingly. An image histogram represents the distribution of points in RGB-space corresponding to the colours of the image pixels, across parallelepipeds. Though very simple, this technique provides good results. A RIA “Novosti” collection of photos on varying subject matter on CD was employed as a test material. Series of pictures similar in colour are to be identified, provided they are available in the base. For example, if we need a picture in bright red colours, we can take a photo of a scarlet rose as a sample and retrieve from the base photos of a red banner, tiled roof or a woman in a red dress, etc. In practice it is reasonable to make a preliminary selection by text descriptions, say, by a list of classification headings. Thus, if we need a summer photo with abundance of greenery, there is a point in searching by a sample photo of a forest or a garden in summertime under the heading “Nature”. Otherwise, a search would provide a lot of redundant material, e.g. a football field or a billiard table.

More accurately images can be compared by means of a square tree technique when calculations are made and colour histograms compared for a quarter of an image (1/16 part of it, etc.) rather than a entire image. Image comparison is based on a distance defined as Euclidean in the space of distances between histograms of their parts. This technique provides semantically different results: images different only in geometry of identically coloured objects are viewed as dissimilar, whereas identified by another technique, they could have been identified as similar. This technique is good wherever colour arrangements in a template picture make a difference to a user. A search for a photo representing a sea sunset can be given as an example.

Of special interest are indexing techniques to be applied not to an entire image but to its individual parts, since instructive value of them vary. An image can be spatially segmented automatically, i.e. segments with certain common attributes - similar or very close in values of a visual feature - are singled out.

In order to detect edges of an object we pinpoint abrupt differences in colour intensity on an image. To get it done an intensity gradient is calculated for each point of the image and thresholding is applied to detect significant edges. The underlying rationale is the Sobel method [ 18] . Thresholding is one of the key issues. We arrange this process taking into account both a global threshold equal to a mean gradient magnitude for an image, and a local threshold - a mean gradient magnitude in a small area around a point under review.

Data processing produces a binary matrix with unites representing the points in which colour intensity changes significantly, and nulls representing the remaining points. To cope with noise and eliminate possible ruptures in contours morphological operations are used. Now unities in the binary matrix represent object boundaries artificially thickened for the purposes of the previous stage of computation. To single out points on the outer boundary the area is progressively by-passed along its outer side in a counter-clockwise direction, the process starts and ends in the lower left point of the area. Minor objects are negligible.

For distinguished image objects parameters like co-ordinates on an image, dimensions, colour characteristics, texture and shape measurements can be determined and indexed.

The shape of objects has been a success in indexing and searching for images [ 2, 7, 20] . The authors have implemented a method under which 128 points are proportionally selected from an object contour to calculate shape characteristics. Object shapes are suggested to be indexed by vector of distances from each distinguished point in the contour to the center of a figure, and by vector of turn angles. Another suggestion is to compare shapes on the basis of computing a total distance between two pairs of vectors – absolute values of difference between appropriate elements taken as a sum are used. The results do not depend on dimensions and locations of objects, correct to turns. A program developed to implement this technique proved to be adequate in retrieving images containing objects whose shapes are similar to the shape of an object in a sample image. Tests have been made on photos of individual items. It should be noted that search by shape similarity is reasonable when the images of a collection represent objects with clear shapes (e.g. collection of posters).

4. Indexing and search for video data

The great size of video files is a problem for efficient search for data, as well as for speedy retrieval of relevant information required by a user. Thus, it makes sense to index each film as a sequence of logically independent parts - video segments called shots [ 13] - rather than a single unit. The task boils down to detection of shot boundaries that can be associated with editing effects, changes in camera positioning, etc. Temporal segmentation can be carried out by means of automated analysis of an image. Appropriate techniques are known [3, 10].

The authors have developed an algorithm based on comparison of color histograms of adjacent frames to detect those frames where an image changes significantly. The shot boundary is detected, if a distance between histograms of the current frame and the previous one is higher than an absolute threshold and at the same time is certain times greater than mean value of histogram difference between adjacent frames calculated from the beginning of a video segment to be detected up to a current frame (relative thresholding).

Segmentation results rely on thresholds. They are set so as to obtain results acceptable from the point of view of reducing errors resulted from identifying a false boundary and missing a real one. Current values are picked up so as to have false identifications somewhat an order of magnitude more often than misses. It stems from the fact that it is impossible to accurately calculate motion parameters in a “double” video segment, and this is one of the objectives to carry out temporal segmentation. The programs have been tested on video films from different sources and of different quality. The program does not miss shot boundaries in colour video films and in animated cartoons, as to black-and-white films misses of boundaries account for 3-4%. Spurious identifications of shot boundaries in various video films amount to no more than 12%.

Once the shots are extracted, some technique is used to extract the representative frames to be examined. This technique can be very simple [ 1] , for instance: if a video segment is shorter than a second, the single central frame is taken, if the shot is longer than one second then one frame per second is chosen. Visual features are calculated for each representative frame: colour histograms, shape and colour characteristics of image objects, texture measures. Techniques are similar to those employed to examine static images. Further, it seems to be of importance to index motion characteristics of camera/stage, as well as objects. These characteristics are defined on the basis of optical flow calculated by the entirety of frames in a video segment [ 3] .

For a video containing moving objects it is possible to compute velocity in every point in a frame - various algorithms of computing optical flow are known [ 8] . The authors apply a differential technique of optical flow calculation used spatiotemporal derivatives of image intensity. The realised method [11] is based on two assumptions: a) when an object is in motion the intensity of its points does not change; b) velocity field varies smoothly almost everywhere in the image. These assumptions lead to a system of linear equations for which the authors of this method propose iterative solution. We apply iterative computations of optical flow based on all frames of a video segment to get more precise values of velocity (one iteration per time-step is implemented). Due to further development of the original method we have managed to get rather accurate velocity values without distortion of the shape and dimensions of moving objects.

Sophisticated data obtained after optical flow has been computed should be put in a simple form appropriate for being indexed and searched for. The paper [ 1] provides a solution to the problem. We offer new characteristics to be computed from mean velocities of video image quadrants. Algorithms can be used to compute both global and local video characteristics: they apply to an entire image, as well as to each of rectangular area containing a moving object.

The program makes it possible to detect substantial in size moving objects on the assumption that velocities in neighbouring points of an object are similar by direction. For distinguished areas the least rectangles embracing them are examined. Apart from dimensions and location, a mean velocity magnitude is computed for points of an object (to provide a possibility to search for video clips where objects move with required intensity), and also mean velocity values in quadrants of an area (to define a motion type). Next stage of the research calls for defining trajectory of movement and simplest events that happen with objects (spatiotemporal interaction, etc.).

After object processing global motion characteristics are examined. To this end mean velocities in image quadrants and mean intensity of scene motion are calculated (velocities of separated objects do not take into account).

Within the framework of research the authors have suggested a multilevel video classification by motion type on the basis of mean velocities in image quadrants:

  1. Motion scheme identificator

The most similar motion scheme is selected. It defines for each quadrant one of 8 principle directions of movement (with an accuracy of 45° ) or absence of significant motion. The first classification type: fragments with alike identificators of their motion scheme are considered similar. Though simple, this technique meets semantics of many real requests. Specifically, typical functions of a camera can be identified (zoom-in, zoom-out, displacement in main directions). For example, movement from the image centre to its corners in all four quadrants corresponds to zooming-in.

  1. Dominant direction

The idea consists in dividing the entire variety of films into two groups: with pronounced general direction of movement and without it. From a human’s point of view, video clips with close shots and a moving central object or with a moving background, as well as clips made by a moving camera, fall into the first group. All the rest are classified with the second. Naturally, that for these clips the direction of motion is essential for indexing, as well as the number of quadrants with this direction.

    3.  The number of quadrants with velocities far from zero.

    4.  Equivalence of schemes with an accuracy to turns.

The division into a set of groups is maintained. Each group is formed by a turn of the basic scheme around its axis by 0 (basic scheme), p /2, p , 3p /2. Semantically a group corresponds to a certain single movement shown from different sides. Typical video effects stay.

The proposed classification would provide comprehensive search for videos. A query would set for a requested video clip the following parameters: motion type of a scene or an object - e.g. rotation or forward motion - with the help of fully or partially defined scheme of motion, certain dominant direction, number of quadrants with velocities other than 0, and motion intensity. The concept of scheme equivalence with an accuracy to turns would make it possible to define a relative scheme of motion in a request.

5. Object recognition

A user of a digital library should be provided with a possibility to make a request concerning visual content in terms of not only visual features, but also high-level objects. A fact of availability of objects in an image, as well as their dimensions and location in a frame are intended to be indexed. Today, object recognition is not considered as a global task: it is decided now for only narrow domains. As a rule, we have to deal with objects of a special interest. For instance, documentary photo- and video materials focus on a human being. Hence, the authors have been trying to detect (not identify) a front view of a human face in images or in video freeze-frames using a neural network. The use of a large number of positive and negative examples for training makes it possible to obtain automatically a sufficiently good model of an object [15, 16, 19].

The system under review can be divided into two subsystems:

1) Pre-processing to eliminate the effect of light source, and a neural network filter that classifies all images of 20 x 20 pixels into two groups – faces and non-faces. Pre-processing module and neural network are applied to each section of an image under review and of images obtained from an original by multiple reduction. The mechanism makes it possible to find faces of different sizes.

2) An arbitrator to reject objects that have been erroneously identified as faces.

As per today we have implemented the first subsystem and the teaching mechanism.

Teaching of the neural network by examples makes it possible to automatically set its parameters. Distinctive features of preparing a teaching set of face images make the identification system invariant to slight inclines of the head. The system can be taught again by examples without faces, if when tested it detects them erroneously.

For teaching 500 images with faces were prepared. They gave rise to the generation of 3000 examples of faces (by a slight turns and mirror reflection of an original image); quite by chance another 500 images without faces were generated. After the latest teaching the system gives good results in face recognition, yet, erroneous facts of recognition are still many. Errors result from difficulties associated with preparation of comprehensive set of negative examples. Another reason is that the arbitration system has not been implemented yet. But we are keep working on it.

6. Conclusions

In developing techniques to search for images and video data on the basis of visual attributes, we believe that the principal objective is to develop a system of comprehensive content-based retrieval in digital collections of images. The best approach seems to be the use of various sources of information such as text description prepared by an expert, automatically computed simple visual characteristics, and detected domain objects.

As to video materials search information can be replenished due to speech recognition and extraction of captions to be subject to text recognition system. Efficient use of a digital video library calls for search structures to bear information of principle sounds that accompany video images. To solve this problem it is essential to automatically identify not only voice but also other acoustic phenomena such as noise, background music, etc. Apart from searching aspects, automatic analysis of audio-video content of video materials is essential to extract the most instructive subset of frames to present a summary of the results of a search. That is important to allow for browsing of a great scope of materials. Thus, a frame value is higher if it has a close shot of a human face and captions.

The research is aimed at providing an efficient access to digital libraries of images and video films stored on the basis of collections accumulated in archives, museums and libraries, as well as private collections increasing in number due to a growing popularity of digital photo- and video cameras.

References

1. Ardizzone, E., La Cascia, M., and Molinelli, D.,

Motion and Color Based Video Indexing and Retrieval,

Proc. Int. Conf. on Pattern Recognition, (ICPR-96), Wien, Austria, Aug. 1996.

http://www.cs.bu.edu/associates/marco/publications.html

2. Ardizzone, E., La Cascia, M., Vito di Gesu, and Valenti, C.,

Content Based Indexing of Image and Video Databases by Global and Shape Features, 1996.

http://www.cs.edu./associates/marco/publications.html

3. Baigarova, N. S. and Bukhshtab, Yu. A.,

Some Principles of Organization for Searching through Video Data,

Programming and Computer Software, Vol. 25, Nu. 3, 1999, pp. 165-170

4. Baigarova, N. S. and Bukhshtab, Yu. A.,

Project "Cinema Chronicle of Russia",

I Russian Workshop on Digital Libraries, Saint-Petersburg, 1999, pp. 209-215

5. Baigarova, N. S., Bukhshtab, Yu. A., and Evteeva, N.N.,

Organization of Video Digital Library,

preprint, Institute of Applied Mathematics, RAS, 2000, N 5

6. Baigarova, N. S., Bukhshtab, Yu. A., Vorobjov, A.A., and Gornij, A.A.,

Visual information management,

preprint, Institute of Applied Mathematics, RAS, 2000, N 6

7. Baigarova, N. S., Bukhshtab, Yu. A., and Gornij, A.A.,

Indexing and retrieval of visual data,

preprint, Institute of Applied Mathematics, RAS, 2000, N 7

8. Baron, J. L., Fleet, D. J., and Beauchemin, S. S., Performances of optical flow techniques.

Int. Journal of Computer Vision, 12:1, pp.43—77, 1994

9. Carson, C. and Ogle, V.E.,

Storage and Retrieval of Feature Data for a Very Large Online Image Collection. 1996.

http://elib.cs.berkeley.edu/papers/

10. Chrictel, M., Stevens, S., Kanade, T., Mauldin, M., Reddy, R., and Wactlar, H.,

Techniques for the Creation and Exploration of Digital Video Libraries,

Multimedia Tools and Applications, Boston: Kluwer, 1996, vol. 2.

11. Horn, B.K.P. and B.G. Schunk, Determining optical flow, Artificial intelligence, 17,1981.

12. Jain, R. and Gupta, A., Computer Vision and Visual Information Retrieval, 1996

http://vision.ucsd.edu/papers/rosenfeld/

13. Jain, R. and Gupta, A., Visual Information Retrieval,

Communications of the ACM, 1997, vol. 40, no. 5.

14. Jain, R., Pentland, A.P., Petkovic, D.,

Workshop Report: NSF – ASPA Workshop on Visual Information Management Systems, 1995.

http://www.virage.com/vim/vimsreport95.html

15. Looney, C.G., Pattern Recognition Using Neural Networks,

Theory and Algorithms for Engineers and Scientists, Oxford University Press, 1997.

16. Rowley, H.A., Baluja, S., and Kanade, T., Neural Network-Based Face Detection,

IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998.

and In Proceedings of International Conference on Computer Vision

and Pattern Recognition, pp. 203-208, San Francisco, CA.

 

Yuri Bukhshtab - The Keldysh Institute of Applied Mathematics, Head of the Laboratory. Bukhshtab has graduated from Moscow State University; Ph.D. in computer science. He is the author of more then 50 published scientific works. Beginning from 1967, he was involved in scientific research and software development in such areas like programming languages, artificial intelligence, application packages, information retrieval systems, database management systems, hypertext technologies.

Natalia Evteeva - The Keldysh Institute of Applied Mathematics, Senior Research Scientist. Evteeva has graduated from Moscow State University; Ph.D. in computer science. She is the author of more then 20 published scientific works. Beginning from 1981, she was involved in software development and using for information retrieval systems, supercomputers, hypertext technologies.

Natalia Baigarova - The Keldysh Institute of Applied Mathematics, Senior Research Scientist; Moscow State University, Instructor. Baigarova has graduated from Moscow State University; Ph.D. in computer science. She has 20 published scientific works. Fields of interest: visual information retrieval techniques, database management systems, information systems, multimedia systems, expert systems, neural networks.

All authors are research workers at the Laboratory of information technologies. The project described above was supported by the grant of the Russian Foundation for Basic Research # 98-07-91158 within the framework of Digital Libraries Program for a term from September 1998 till December, 2000. In addition, Yuri Bukhshtab, the Head of this Lab, is a Project Director for electronic catalog creation of Russian State Film and Photo Archive films collection. This project was awarded in 2000 by the grant of Open Society Institute ($115000).

Our Lab is a department of the Keldysh Institute of Applied Mathematics. It is one of the largest Research Centers in Russia, its staff is about 600 researchers. Keldysh Institute was founded in 1953 to solve complex mathematical problems involved in national projects of space exploration, atomic and thermonuclear energy applications, etc. The Institute has been an initiator in utilizing computer facilities in Russia. Since its first years the Institute activity oriented to solving large scale applied problems is based on the results of fundamental scientific research in mathematics, mechanics, cybernetics, informatics, etc. which is carried out by the Institute scientists.