Study/Machine Learning

[논문해석] Temporal Convolutional Networks for Action Segmentation and Detection

얼죽아여뜨샤 2023. 11. 14. 00:08

0. 원문

 

Temporal_Convolutional_Networks_for_Action_Segmentation_and_Detection.pdf
0.67MB

1. 해석

(0) Abstract

The ability to identify and temporally segment finegrained human actions throughout a video is crucial for robotics, surveillance, education, and beyond. Typical approaches decouple this problem by first extracting local spatiotemporal features from video frames and then feeding them into a temporal classifier that captures highlevel temporal patterns. We describe a class of temporal models, which we call Temporal Convolutional Networks (TCNs), that use a hierarchy of temporal convolutions to perform fine-grained action segmentation or detection. Our Encoder-Decoder TCN uses pooling and upsampling to efficiently capture long-range temporal patterns whereas our Dilated TCN uses dilated convolutions. We show that TCNs are capable of capturing action compositions, segment durations, and long-range dependencies, and are over a magnitude faster to train than competing LSTM-based Recurrent Neural Networks. We apply these models to three challenging fine-grained datasets and show large improvements over the state of the art.

비디오 전체에 걸쳐 세부적인 인간 행동을 식별하고 시간적으로 분할하는 능력은 로봇공학, 감시, 교육 등 여러 분야에서 중요합니다. 전형적인 접근법은 먼저 비디오 프레임에서 지역 시공간 특징을 추출한 다음, 고수준 시간 패턴을 포착하는 시간 분류기에 이러한 특징을 입력으로 제공하여 문제를 분리합니다. 우리는 Temporal Convolutional Networks(TCNs)라는 시간 모델 클래스를 소개합니다. 이는 비디오에서 세부적인 행동 분할이나 감지를 수행하기 위해 계층적인 시간 합성곱을 사용합니다. 우리의 인코더-디코더 TCN은 풀링과 업샘플링을 사용하여 장거리 시간 패턴을 효과적으로 포착하며, 우리의 Dilated TCN은 확장된 합성곱을 사용합니다. TCNs는 행동 조합, 세그먼트 지속시간 및 장거리 종속성을 포착하는 데 능하며 경쟁하는 LSTM 기반의 순환 신경망에 비해 훈련 속도가 한 차원 빠릅니다. 이러한 모델을 세 가지 어려운 세부 데이터셋에 적용하고 기존 기술 대비 큰 향상을 보입니다.

 

(1) Introduction

Action segmentation is crucial for applications ranging from collaborative robotics to analysis of activities of daily living. Given a video, the goal is to simultaneously segment every action in time and classify each constituent segment. In the literature, this task goes by either action segmentation or detection. We focus on modeling situated activities – such as in a kitchen or surveillance setup – which are composed of dozens of actions over a period of many minutes. These actions, such as cutting a tomato versus peeling a cucumber, are often only subtly different from one another.

활동 분할은 협력 로봇에서부터 일상 활동 분석에 이르기까지 다양한 응용 분야에 중요합니다. 주어진 비디오에서의 목표는 동시에 모든 활동을 시간에 따라 분할하고 각 구성 세그먼트를 분류하는 것입니다. 문헌에서는 이 작업이 활동 분할 또는 감지로 불립니다. 우리는 주로 주방이나 감시 설정과 같은 상황 활동을 모델링하는 데 중점을 둡니다. 이러한 상황 활동은 여러 분 동안 수십 가지의 활동으로 구성되어 있습니다. 이러한 활동들은 토마토를 자르는 것과 오이를 까는 것과 같이 종종 서로 구별하기 어렵습니다.

 

Current approaches decouple this task into extracting low-level spatiotemporal features and applying high-level temporal classifiers. While there has been extensive work on the former, recent temporal models have been limited to sliding window action detectors [26, 28, 21], which are typically too short to capture long-range temporal patterns; segmental models [25, 16, 24], which typically condition the current action class on the previous segment, thus ignoring long-range latent dependencies; and recurrent models [28, 10], which empirically can have a limited span of attention [28] and are hard to correctly train [22]. For many of these models, such as RNNs with LSTM or GRUs [5], the latent state at each time step, t, is only a function of the data at t and the hidden state and memory at t − 1. This is limiting when an action is defined by the changes in features over the course of many frames.

현재의 접근법은 이 작업을 낮은 수준의 시공간 특징을 추출하고 고수준의 시간 분류기를 적용하는 것으로 분리합니다. 전자에 대한 광범한 연구가 있었지만, 최근의 시간 모델은 일반적으로 긴 범위의 시간 패턴을 포착하기에는 너무 짧은 슬라이딩 윈도우 액션 감지기 [26, 28, 21]에 제한되어 있습니다. 이러한 감지기는 일반적으로 긴 범위의 시간적 패턴을 포착하기에는 짧습니다. 세그멘트 모델 [25, 16, 24]은 일반적으로 현재의 작업 클래스를 이전 세그먼트에 의존시키기 때문에 장기간의 숨겨진 종속성을 무시합니다. 재발되는 모델 [28, 10]은 경험적으로 주의의 한정된 범위를 가질 수 있으며 [28] 올바르게 훈련하기 어려울 수 있습니다. LSTM 또는 GRU [5]와 같은 이러한 모델의 경우 각 시간 단계 t에서의 잠재 상태는 주어진 t에서의 데이터 및 t-1에서의 숨겨진 상태 및 메모리의 함수에 불과합니다. 이는 많은 프레임에 걸쳐 특징의 변화로 정의된 작업이 있을 때 제한적입니다.

 

In this paper, we discuss a class of time-series models, which we call Temporal Convolutional Networks (TCNs), that overcome the previous shortcomings by capturing longrange patterns using a hierarchy of temporal convolutional filters. We present two types of TCNs: First, our EncoderDecoder TCN (ED-TCN) only uses a hierarchy of temporal convolutions, pooling, and upsampling but can efficiently capture long-range temporal patterns. The ED-TCN has a relatively small number of layers (e.g., 3 in the encoder) but each layer contains a set of long convolutional filters. Second, a Dilated TCN uses dilated convolutions instead of pooling and upsampling and adds skip connections between layers. This is an adaptation of the recent WaveNet [34] model, which shares similarities to our ED-TCN but was developed for speech synthesis. The Dilated TCN has more layers, but each uses dilated filters that only operate on a small number of time steps. Empirically, both TCNs are capable of capturing features of segmental models, such as action durations and pairwise transitions between segments, as well as long-range temporal patterns similar to recurrent models. These models tend to outperform our Bidirectional LSTM (Bi-LSTM) [9] baseline and are over a magnitude faster to train. The ED-TCN in particular produces many fewer over-segmentation errors than other models.

이 논문에서는 이전의 단점을 극복하기 위해 장거리 패턴을 포착하는 일련의 시계열 모델을 논의합니다. 이를 'Temporal Convolutional Networks (TCNs)'라고 부르며, 여러 계층의 시간 합성 필터를 사용하여 장거리 패턴을 효과적으로 포착합니다. 우리는 두 가지 유형의 TCN을 제시합니다. 첫째, 'Encoder-Decoder TCN (ED-TCN)'은 계층적인 시간 합성, 풀링 및 업샘플링만 사용하지만 효율적으로 장거리 시간 패턴을 포착할 수 있습니다. ED-TCN은 비교적 적은 수의 레이어(예: 인코더에서 3개)를 가지고 있지만 각 레이어에는 여러 개의 긴 컨볼루션 필터가 포함되어 있습니다. 둘째, 'Dilated TCN'은 풀링 및 업샘플링 대신 확장된 컨볼루션을 사용하고 레이어 간에 스킵 연결을 추가합니다. 이는 최근 WaveNet [34] 모델을 적응한 것으로, 우리의 ED-TCN과 유사점이 있지만 음성 합성을 위해 개발되었습니다. Dilated TCN은 더 많은 레이어를 가지고 있지만 각 레이어는 소수의 타임 스텝에서만 작동하는 확장된 필터를 사용합니다. 경험적으로 두 TCN 모두 세그멘트 모델의 특징 (예: 액션 지속 시간 및 세그먼트 간 이동)과 장거리 시간 패턴을 효과적으로 포착할 수 있습니다. 이러한 모델은 Bidirectional LSTM (Bi-LSTM) [9] 기준보다 우월하며 훈련 속도가 한 차원 이상 빠릅니다. 특히 ED-TCN은 다른 모델보다 훨씬 적은 과분할 오류를 생성합니다.

 

In the literature, our task goes by the name action segmentation [7, 8, 6, 15, 29, 16, 10] or action detection [28, 20, 21, 25]. Despite effectively being the same problem, the temporal methods in segmentation papers tends to differ from detection papers, as do the metrics by which they are evaluated. In this paper, we evaluate on datasets targeted at both tasks and propose a segmental F1 score, which we qualitatively find is a meaningful metric for applications of both tasks. We use MERL Shopping [28] which was designed for action detection, Georgia Tech Egocentric Activities [8] which was designed for action segmentation, and 50 Salads [30] which has been used for both. Code, features, and temporal predictions are available.

https://github.com/colincsl/

문헌에서는 이 작업이 'action segmentation' [7, 8, 6, 15, 29, 16, 10] 또는 'action detection' [28, 20, 21, 25]으로 불립니다. 비록 본질적으로는 동일한 문제이지만 세그멘테이션 논문의 시간적 방법은 검출 논문과 다를 수 있으며, 평가되는 지표도 다릅니다. 본 논문에서는 두 작업에 대한 대상 데이터셋에서 평가를 수행하며, 세그멘테이션 작업의 응용 프로그램에 대한 의미 있는 지표로 세그멘테이션 F1 점수를 제안합니다. 저희는 액션 검출을 위해 설계된 MERL Shopping [28], 액션 세그멘테이션을 위해 설계된 Georgia Tech Egocentric Activities [8], 그리고 양쪽에 사용된 50 Salads [30] 데이터셋을 사용합니다. 코드, 피처 및 시간 예측은 이용 가능합니다.

 

(2) Related Work

Action segmentation methods predict what action is occurring at every frame in a video and detection methods output a sparse set action segments, where a segment is defined by a start time, end time, and class label. It is possible to convert between a given segmentation and set of detections by simply adding or removing null/background segments.

액션 세그멘테이션 방법은 비디오의 모든 프레임에서 어떤 행동이 발생하는지 예측하고 감지 방법은 시작 시간, 종료 시간 및 클래스 레이블로 정의된 행동 세그먼트 집합을 출력합니다. 주어진 세그멘테이션과 감지 세트 간에는 단순히 null/배경 세그먼트를 추가하거나 제거하여 변환할 수 있습니다.

 

Action Detection: Many fine-grained detection papers use sliding window-based detection methods on spatial or spatiotemporal features. Rohrbach et al. [26] used Dense Trajectories [37] and human pose features on the MPII Cooking dataset. At each frame they evaluated a sliding SVM for many candidate segment lengths and performed nonmaximal suppression to find a small set of action predictions. Ni et al. [21] used an object-centric feature representation, which iteratively parses object locations and spatial configurations, and applied it to the MPII Cooking and ICPR 2012 Kitchen datasets. Their approach used Dense Trajectory features as input into a sliding-window detection method with segment intervals of 30, 60, and 90 frames.

액션 감지: 많은 미세한 감지 논문들은 공간 또는 시공간 특징에 기반한 슬라이딩 윈도우 기반의 감지 방법을 사용합니다. 

Rohrbach 등 [26]은 MPII Cooking 데이터셋에서 Dense Trajectories [37]와 인간 자세 특징을 사용했습니다. 각 프레임에서 많은 후보 세그먼트 길이에 대해 슬라이딩 SVM을 평가하고 비최대 억제를 수행하여 소수의 액션 예측을 찾았습니다. Ni 등 [21]은 물체 중심의 특징 표현을 사용하여 객체 위치와 공간 구성을 순환적으로 파싱하고 이를 MPII Cooking 및 ICPR 2012 Kitchen 데이터셋에 적용했습니다. 그들의 접근 방식은 Dense Trajectory 특징을 입력으로 사용하여 세그먼트 간격이 30, 60 및 90 프레임인 슬라이딩 윈도우 감지 방법을 사용했습니다.

Singh et al. [28] improved upon this by feeding per-frame CNN features into an LSTM model and applying a method analogous to non-maximal suppression to the LSTM output. We use Singh’s proposed dataset, MERL Shopping, and show our approach outperforms their LSTM-based detection model. Recently, Richard et al. [25] introduced a segmental approach that incorporates a language model, which captures pairwise transitions between segments, and a duration model, which ensures that segments are of an appropriate length. In the experiments section we show that our model is capable of capturing both of these components.

Singh 등 [28]은 이를 개선하기 위해 프레임별 CNN 특징을 LSTM 모델에 공급하고 LSTM 출력에 비-최대 억제와 유사한 방법을 적용했습니다. 우리는 Singh이 제안한 MERL Shopping 데이터셋을 사용하고 우리의 접근 방식이 그들의 LSTM 기반 감지 모델보다 우수함을 보여줍니다. 최근에 Richard 등 [25]은 언어 모델을 통합한 세그멘탈 접근법을 소개했는데, 이 모델은 세그먼트 간의 쌍별 전환을 캡처하는 데 사용되며, 세그먼트가 적절한 길이로 유지되도록 하는 기간 모델이 포함되어 있습니다. 실험 섹션에서 우리의 모델이 이러한 구성요소를 모두 캡처할 수 있는 능력을 보여줍니다.

 

...

 

(3) Temporal Convolutional Network

In this section we define two TCNs, each of which have the following properties: (1) computations are performed layer-wise, meaning every time-step is updated simultaneously, instead of updating sequentially per-frame (2) convolutions are computed across time, and (3) predictions at each frame are a function of a fixed-length period of time, which is referred to as the receptive field. Our ED-TCN uses an encoder-decoder architecture with temporal convolutions and the Dilated TCN, which is adapted from the WaveNet model, uses a deep series of dilated convolutions. The input to a TCN will be a set of video features, such as those output from a spatial or spatiotemporal CNN, for each frame of a given video. Let Xt ∈ RF0 be the input feature vector of length F0 for time step t for 1 ≤ t ≤ T. Note that the number of time steps T may vary for each video sequence. The action label for each frame is given by vector Yt ∈ {0, 1}C , where C is the number of classes, such that the true class is 1 and all others are 0.

이 섹션에서는 다음과 같은 두 가지 특성을 갖는 두 개의 TCN을 정의합니다: (1) 계산은 계층별로 수행되며 매 시간 단계가 순차적으로 업데이트되는 대신 동시에 업데이트됩니다. (2) 합성곱은 시간을 가로지르며 (3) 각 프레임에서의 예측은 수용 영역이라 불리는 고정 길이의 시간 동안의 함수입니다. 우리의 ED-TCN은 시간 합성곱을 사용하는 인코더-디코더 아키텍처를 사용하고 WaveNet 모델에서 적응된 Dilated TCN은 증가된 식의 일련의 컨볼루션을 사용합니다. TCN의 입력은 주어진 비디오의 각 프레임에 대한 공간 또는 시공간 CNN에서 출력된 것과 같은 비디오 기능 세트입니다. T에 대한 각 시간 단계 t에 대한 길이 F0의 입력 특징 벡터 Xt ∈ RF0입니다. T의 시간 단계 수는 각 비디오 시퀀스에 따라 다를 수 있음에 유의하십시오. 각 프레임의 작업 레이블은 C가 클래스 수인 벡터 Yt ∈ {0, 1}C로 주어지며, 실제 클래스는 1이고 다른 모든 클래스는 0입니다.

 

3.1 Encoder-Decoder TCN

Our encoder-decoder framework is depicted in Figure 1. The encoder consists of L layers denotes by E(l) ∈ RFl×Tl where Fl is the number of convolutional filters in a the lth layer and Tl is the number of corresponding time steps. Each layer consists of temporal convolutions, a non-linear activation function, and max pooling across time. We define the collection of filters in each layer as W = {W(i)}Fl i=1 for W(i) ∈ Rd×Fl−1 with a corresponding bias vector b ∈ RFl . Given the signal from the previous layer, E(l−1), we compute activations E(l) with

where f(·) is the activation function and ∗ is the (“same”) convolution operator. We compare activations in Section 4.4 and find Normalized Rectified Linear Units perform best. After each activation function we max pool with width 2 across time so Tl = 1 2Tl−1. Pooling enables us to efficiently compute activations over long temporal windows.

우리의 인코더-디코더 프레임워크는 그림 1에 나와 있습니다. 인코더는 각각 Fl이 l번째 레이어의 합성곱 필터 수이고 Tl이 해당 시간 단계 수인 E(l) ∈ RFl×Tl로 나타내는 L개의 레이어로 구성됩니다. 각 레이어는 시간적인 합성곱, 비선형 활성화 함수 및 시간에 걸친 최대 풀링으로 구성되어 있습니다. 각 레이어의 필터 모음을 W = {W(i)}Fl i=1로 정의하며 여기서 W(i) ∈ Rd×Fl−1이며 해당하는 편향 벡터 b ∈ RFl을 가집니다. 이전 레이어의 신호 E(l−1)이 주어지면 활성화 E(l)을 다음과 같이 계산합니다.

여기서 f(·)은 활성화 함수이고 ∗는 ("same") 합성곱 연산자입니다. 섹션 4.4에서 활성화를 비교하고 정규화된 ReLU(Rectified Linear Units) 함수가 가장 잘 수행된다는 것을 확인합니다. 각 활성화 함수 이후에는 시간을 기준으로 너비 2로 최대 풀링을 수행하여 Tl = 1/2*T_l−1이 됩니다. 풀링을 통해 우리는 긴 시간 창에 걸친 활성화를 효율적으로 계산할 수 있습니다.

 

Our decoder is similar to the encoder, except that upsampling is used instead of pooling and the order of the operations is now upsample, convolve, and apply the activation function. Upsampling is performed by simply repeating each entry twice. The convolutional filters in the decoder distribute the activations from the condensed layers in the middle to the action predictions at the top. Experimentally, these convolutions provide a large improvement in performance and appear to capture pairwise transitions between actions. Each decoder layer is denoted by D(l) ∈ RFl×Tl for l ∈ {L, . . . , 1}. Note that these are indexed in reverse order compared to the encoder, so the filter count in the first encoder layer is the same as in the last decoder layer.

우리의 디코더는 인코더와 유사하지만 풀링 대신 업샘플링을 사용하고 연산의 순서는 이제 업샘플, 합성곱, 활성화 함수를 적용으로 변경됩니다. 업샘플링은 각 항목을 두 번 반복하여 수행됩니다. 디코더의 컨볼루션 필터는 중간에 압축된 레이어의 활성화를 상단의 액션 예측으로 분배합니다. 실험적으로 이러한 컨볼루션은 성능을 크게 향상시키며 액션 간의 쌍별 전이를 포착하는 것으로 보입니다. 각 디코더 레이어는 D(l) ∈ RFl×Tl로 표시되며 l ∈ {L, . . . , 1}입니다. 이것들은 인코더와 비교하여 역방향으로 색인이 지정되므로 첫 번째 인코더 레이어의 필터 수는 마지막 디코더 레이어와 동일합니다.

 

'Study > Machine Learning' 카테고리의 다른 글

#2 파이썬 기초  (0) 2024.06.26
#1 Anaconda3& & Python & Pycharm install  (0) 2024.06.24
TCN(Temporal Convolution Network)  (0) 2023.11.13
RNN(Recurrent Neural Network)  (0) 2023.11.13