Hsueh-Ming Hang
Information Fusion in Some Image Machine Learning Applications
Hsueh-Ming Hang received the B.S. and M.S. degrees from National Chiao Tung University, Hsinchu, Taiwan, in 1978 and 1980, respectively, and Ph.D. in Electrical Engineering from Rensselaer Polytechnic Institute, Troy, NY, in 1984. From 1984 to 1991, he was with AT&T Bell Laboratories, Holmdel, NJ, and then he joined the Electronics Engineering Department of National Chiao Tung University (NCTU), Hsinchu, Taiwan, in December 1991. From 2006 to 2009, he took a leave from NCTU and was appointed as Dean of the EECS College at National Taipei University of Technology (NTUT). He is currently the Dean of the ECE College, NCTU. He has been actively involved in the international MPEG standards since 1984 and his current research interests include multimedia compression, multiview image/video processing, and deep-learning based image/video processing.
Dr. Hang holds 13 patents (Taiwan, US and Japan) and has published over 190 technical papers related to image compression, signal processing, and video codec architecture. He was an associate editor (AE) of the IEEE Transactions on Image Processing (1992-1994, 2008-2012) and the IEEE Transactions on Circuits and Systems for Video Technology (1997-1999). He is a co-editor and contributor of the Handbook of Visual Communications published by Academic Press in 1995. He was an IEEE Circuits and Systems Society Distinguished Lecturer (2014-2015) and is a Board Member of the Asia-Pacific Signal and Information Processing Association (APSIPA) (2013-2018). He is a recipient of the IEEE Third Millennium Medal and is a Fellow of IEEE and IET and a member of Sigma Xi.
Information Fusion in Some Image Machine Learning Applications
The term “Information” is used in a broad sense here. We simply like to combine the information from different sources. They may have different characteristics or meaning. Another core technique in our systems is machine learning, particularly, Convolutional Neural Network (CNN). A typical CNN system is end-to-end, but in order to combine information from different sources, we propose multiple-stage system structures. We will show two examples in this talk: Human detection based RGB-D data and multiple query image retrieval. They show two distinct directions. The RGB image and the Depth image show very different characteristics although they are captured on the same object. Do they provide complementary information? In image retrieval, the second query is a different view of the same building, say. Does it help in increasing retrieval accuracy?
The image content-based image retrieval for a large database becomes an essential tool in image processing. We focus on retrieving a specific building using a view different from the photographing angles in the database. In addition, if the user can provide additional images as the second or the third queries, how do we combine the information provided by these multiple queries? Thus, we propose multi-query fusion methods to achieve a better accuracy. In this study, we test two different types of features designed for retrieval purpose. We adopt the Scale-Invariant Feature Transform (SIFT) feature as the low-level feature and the Convolutional Neural Network (CNN) feature as the high-level feature for retrieval. The AlexNet is adopted as our CNN model and it is modified to the Siamese-Triplet Network to match the image retrieval purpose. There are several levels of data fusion proposals. Our best proposed method can exceed most of the state-of-the-art retrieval methods for a single query. The multi-query retrieval can further increase the retrieval accuracy.
Accurate human detection is still a challenging topic due to complicated environments in the real world. On the other hand, the RGB-D cameras are becoming popular at reasonable price, such as Microsoft Kinect sensor, which provides both RGB and depth data. The depth information is often helpful for detection. We adopt the R-CNN method in this paper, which combines the Selective Search technique to generate region proposals and the CNN (Convolutional Neural Network) to produce features. A depth map encoding technique (HHA) is adopted to match the CNN format for feature learning. The HHA and RGB images are our system inputs. We propose several algorithms to combine their information in constructing various human detectors. Our information fusion structures include CNN, SVM together with PCA for feature reduction. More accurate human detection results are shown with the aid of depth information.
Ralf Schäfer
Depth Based Image Processing for Videogrammetry and 3D Endoscopy/Microscopy
Ralf Schäfer is Director of the Video Division at Fraunhofer Heinrich Hertz Institute (HHI) in Berlin where he is responsible for 90 researchers and 50 undergraduate students. He studied electrical engineering at the Technical University Berlin (TUB) and joined HHI as a researcher in 1977. In 1984 he received his doctorate at TUB in the area of digital video coding of TV signals. His research interests cover all areas related to images and video, from acquisition to display and from algorithm development to ASIC implementation. Besides his role as Division Director, he is responsible for three technology centers, the CINIQ Center for Data and Information Intelligence (http://www.ciniq.de), the Innovation Center for Immersive Imaging Technologies - 3IT (http://www.3it-berlin.de) and Tomorrow’s immersive Media Experience (TiME) Lab (http://www.timelab-hhi.com), where smart data solutions and immersive technologies are developed and demonstrated.
Ralf Schäfer is member of the German „Society for Information Technology“ (ITG), where he chaired the experts group „Digital Coding“ for more than 20 years. Furthermore he is member of the German „Society for Television and Motion Picture Technology“ (FKTG), where he belongs to the URTEL Award Committee.
In 1986 he received the paper award of the ITG and in 2000 the Richard Theile Medal of the FKTG.
He was co-founder of the two spin-off companies 2SK Media Technologies and MikroM GmbH.
Depth Based Image Processing for Videogrammetry and 3D Endoscopy/Microscopy
Depth based image processing has become a key technology for many applications in media (e.g. postproduction, 3D film, 3DTV, VR, videogrammetry), communication (e.g. 3D video coding), automotive (e.g. driver assistance) and medicine (e.g. 3D endoscopy, 3D microscopy).
The basic technology for all these applications is disparity estimation, for which we usually use a multistep approach in our work. It consists of robust and time consistent Hybrid Recursive Matching (HRM) and a subsequent highly-accurate Patch-Sweeping. To improve the visual experience, the estimated depth maps are post-processed by cross-bilateral filtering and temporal stabilization, in order to reduce spatial and temporal inconsistencies.
Once good disparity maps are available, they can be used for different forms of video processing and applications. One of these techniques is Depth Image Based Rendering (DIBR) which can be used to compute intermediate views in 3D video. Another application is 3D modelling of moving persons or objects, which is also denoted as videogrammetry – derived from photogrammetry for static objects.
This presentation will mainly focus on two applications, i.e. on depth based processing for digital stereography and on videogrammetry, which can be used in media production in general and for Virtual Reality (VR) in particular, but it can also be used for medical application such as rehabilitation of stroke patients.
The significance of digital stereoscopy for medical technology, such as surgery microscopy and endoscopy, is increasing since years. Stereoscopic systems offer new possibilities to extract new kinds of information, to add value to operational conditions and to improve work processes in the surgery room. However, it is also well known that stereo with limited quality may cause eyestrain, headaches or visual fatigue. Obviously, such circumstances have to be avoided in surgery environments. Therefore, a number of rules such as geometric, photometric and colorimetric alignment of stereo pairs, avoidance of window violations, moderate depth budget (comfort zone) etc. have to have to be respected to guarantee good quality in medical applications. However, there are a number of challenges in medical stereoscopy such as extreme close-ups leading to large disparities and high depth variations may occur. In addition, such systems use stereo sensors with small baseline (inter-axial distance), differing considerably from the interpupillary eye-distance of the viewer. As a consequence, there is only a small range of working distances that ensures natural shape persistence. Furthermore, surgery instruments can introduce strong and very annoying window violations. In this context already existing and reliable methods of quality control and enhancement that are well-known from media applications have to be modified and tailored to the special requirements and working conditions of medical application. It is possible to overcome these problems by applying multiple steps of 3D video processing. This includes special methods to correct errors occurring during stereo capturing and subsequent processing to guarantee comfortable stereo viewing conditions at a 3D display or through a binocular. By applying such methods, one can improve 3D perception quality and reduce visual discomfort significantly. These improvements also pave the road for further fields of applications such as virtual assistance tools for surgery. Furthermore, depth image-based rendering can be used to generate new virtual stereo pairs with improved depth representation (e.g. linear reduction or enlargement of depth range or its non-linear compression or expansion to highlight depth regions of particular interest).
Videogrammetry means 3D scene reconstruction of persons or moving objects for real-time production, post-production and on-set visual effect previews. The applied approach is based on multiple trifocal camera capture systems which can be arbitrarily distributed on set. Here the problem of multi-view data fusion has to be solved. Instead of performing a pixel wise processing we use patch groups as higher level scene representations. Based on the robust results of the trifocal sub-systems we implicitly obtain an optimized set of patch groups even for partly occluded regions by the application of a simple geometric rule set. Further on, we show that a simplified meshing can be applied to the patch group borders, which enables a GPU centric real-time implementation. The method is tested on real world test shoot data for the case of 3D reconstruction of humans.
Maciej Niedźwiecki
Elimination of impulsive disturbances from archive audio recordings
Maciej Niedźwiecki received the M.Sc. and Ph.D. degrees from the Gdańsk University of Technology, Gdańsk, Poland, and the Dr.Hab. (D.Sc.) degree from the Technical University of Warsaw, Warsaw, Poland, in 1977, 1981 and 1991, respectively.
He spent three years as a Research Fellow with the Department of Systems Engineering, Australian National University from 1986 to 1989. From 1990 to 1993 he served as a Vice-Chairman of Technical Committee on Theory of the International Federation of Automatic Control (IFAC), and from 2009 to 2014 - as an Associate Editor for IEEE Transactions on Signal Processing. He is currently a Professor and Head of the Department of Automatic Control, Faculty of Electronics, Telecommunications and Informatics, Gdańsk University of Technology. His main areas of research interests include system identification, statistical signal processing and adaptive systems. He is the author of the book Identification of Time-varying Processes (Wiley, 2000) and the principal author of the commercial software package for restoration and re-mastering of archive audio recordings called DART (Digital Audio Restoration Technology) which over the past 20 years has been used by more than 2000 audio prosumers worldwide.
Dr. Niedźwiecki is currently a member of the IFAC committees on Modeling, Identification and Signal Processing and on Large Scale Complex Systems, and a member of the Automatic Control and Robotics Committee of the Polish Academy of Sciences (PAN).
Elimination of impulsive disturbances from archive audio recordings
Archived audio recordings are often degraded by impulsive disturbances and wideband noise. Clicks, pops, ticks, crackles, and record scratches are caused by aging and/or mishandling of the surface of gramophone records (shellac or vinyl), specks of dust and dirt, faults in the record stamping process (e.g. gas bubbles), and slight imperfections in the record playing surface due to the use of coarse grain filters in the record composition. In the case of magnetic tape recordings, impulsive disturbances can be usually attributed to transmission or equipment artifacts (e.g. electric or magnetic pulses).
Wideband background noise, such as the so-called surface noise of magnetic tapes and phonograph records, is an inherent component of all analog recordings.
Elimination of both types of disturbances from archive audio documents is an important element of saving our cultural heritage. The Polish Radio Archives and the Polish National Library Archives alone contain more than one million archive audio documents with different content (historic speeches, interviews, concerts, studio music recordings etc.), saved on different media, such as piano rolls, phonograph and gramophone records, magnetic tapes etc. The British Library Sound Archive (which is among the largest collections of recorded sound in the world) holds over three million recordings, including over a million of disks and 200,000 tapes. Digitization of these documents is an ongoing process (in Poland carried out, among others, by the Polish National Digital Archives), which will be very soon followed by the next, obvious step - audio restoration. This makes research on audio restoration technology both practically useful and timely.
The majority of known approaches to elimination of impulsive disturbances from archive audio signals are based on adaptive prediction - the autoregressive (AR) or autoregressive moving average (ARMA) model of the analyzed signal is continuously updated and used to predict consecutive signal samples. Whenever the absolute value of the one-step-ahead prediction error becomes too large, namely when it exceeds a prescribed multiple of its estimated standard deviation, a "detection alarm" is raised, and the predicted sample is scheduled for reconstruction. The test is then extended to multiple-step-ahead prediction errors - detection alarm is terminated when a given number of samples in a row remain sufficiently close to the predicted signal trajectory (or when the length of detection alarm reaches its maximum allowable value). Finally, once the pulse is localized, the corrupted samples are interpolated (using the same signal model which served for detection purposes) based on the uncorrupted neighboring samples.
The presentation will provide an overview of the currently used techniques for impulsive noise reduction with the focus on the most recent advances such as sparse modeling, bidirectional processing, deterministic and stochastic pattern matching, semi-causal detection and Gibbs sampling.