Table of Contents
Fetching ...

ScreenWriter: Automatic Screenplay Generation and Movie Summarisation

Louis Mahon, Mirella Lapata

TL;DR

This work proposes the task of automatic screenplay generation, and a method, ScreenWriter, that operates only on video and produces output which includes dialogue, speaker names, scene breaks, and visual descriptions, and introduces a novel algorithm to segment the video into scenes based on the sequence of visual vectors.

Abstract

The proliferation of creative video content has driven demand for textual descriptions or summaries that allow users to recall key plot points or get an overview without watching. The volume of movie content and speed of turnover motivates automatic summarisation, which is nevertheless challenging, requiring identifying character intentions and very long-range temporal dependencies. The few existing methods attempting this task rely heavily on textual screenplays as input, greatly limiting their applicability. In this work, we propose the task of automatic screenplay generation, and a method, ScreenWriter, that operates only on video and produces output which includes dialogue, speaker names, scene breaks, and visual descriptions. ScreenWriter introduces a novel algorithm to segment the video into scenes based on the sequence of visual vectors, and a novel method for the challenging problem of determining character names, based on a database of actors' faces. We further demonstrate how these automatic screenplays can be used to generate plot synopses with a hierarchical summarisation method based on scene breaks. We test the quality of the final summaries on the recent MovieSum dataset, which we augment with videos, and show that they are superior to a number of comparison models which assume access to goldstandard screenplays.

ScreenWriter: Automatic Screenplay Generation and Movie Summarisation

TL;DR

This work proposes the task of automatic screenplay generation, and a method, ScreenWriter, that operates only on video and produces output which includes dialogue, speaker names, scene breaks, and visual descriptions, and introduces a novel algorithm to segment the video into scenes based on the sequence of visual vectors.

Abstract

The proliferation of creative video content has driven demand for textual descriptions or summaries that allow users to recall key plot points or get an overview without watching. The volume of movie content and speed of turnover motivates automatic summarisation, which is nevertheless challenging, requiring identifying character intentions and very long-range temporal dependencies. The few existing methods attempting this task rely heavily on textual screenplays as input, greatly limiting their applicability. In this work, we propose the task of automatic screenplay generation, and a method, ScreenWriter, that operates only on video and produces output which includes dialogue, speaker names, scene breaks, and visual descriptions. ScreenWriter introduces a novel algorithm to segment the video into scenes based on the sequence of visual vectors, and a novel method for the challenging problem of determining character names, based on a database of actors' faces. We further demonstrate how these automatic screenplays can be used to generate plot synopses with a hierarchical summarisation method based on scene breaks. We test the quality of the final summaries on the recent MovieSum dataset, which we augment with videos, and show that they are superior to a number of comparison models which assume access to goldstandard screenplays.

Paper Structure

This paper contains 10 sections, 1 equation, 2 figures, 2 algorithms.

Figures (2)

  • Figure 1: Computing the cost of assigning the character Clarice Starling (Jodie Foster) to three different scenes of The Silence of the Lambs (1991). After computing the cost of assigning a character to a each scene, we then compute the cost of assigning a character to a speaker ID as the mean of the cost of assigning them to all scenes that speaker ID appears in.
  • Figure 2: Example of a snippet from the generated screenplay for The Silence of the Lambs (1991). The left side shows the transcribed text, with the names inferred by our method. The right side shows the visual captions, along with the keyframe from which they were derived. The horizontal line shows the inferred scene break. We experimentally show that the generated screenplay can be used as a basis for movide summarisation. We adopt a hierarchical summarisation approach pang-etal-2023-longchang2023booookscore, as it has been shown to be particularly suited to long inputs that exceed the context window size of large language models, and in our case can leverage the organization of the content into scenes. We this first summarise the transcript dialogue of each scene, and then fuse the resulting sequence of summaries, along with the visual information for each scene into a single summary for the entire movie (see Figure \ref{['fig:method']}). Our summariser is implemented using a widely-used open-source LLM library dubey2024llama3herdmodels with zero-shot prompting.Keyframes are extracted using FFMPEG's scene detect filter. The full command is given in Appendix \ref{['app:ffmpeg']}. Visual features are extracted from keyframes using CLIP pmlr-v139-radford21a. The precise model used in the experiments of Section \ref{['sec:results']} is 'CLIP-ViT-g-14-laion2B-s12B-b42K' from https://github.com/mlfoundations/open_clip. This speaker diarization model is WhisperX bain23_interspeech, an extension of Open AI's Whisper model which can perform speaker diarization and accurate utterance timestamping. For visual descriptions, we use Kosmos 2 peng2023kosmos, which has been pretrained on several multimodal corpora as well as grounded image-text pairs (spans from the text are associated with image regions) and instruction-tuned on various vision-language instruction datasets. Our summarisation model is built on top of Llama 3.1 70B touvron2023llama. We use short simple prompts for Llama and Kosmos, which are given in full in Appendix \ref{['app:prompts']}. We instruct summaries to be a maximum of 635 words (the mean in our test set), and truncate summaries to 650 words if they are longer.We take screenplays (for comparison models and some testing, see below) and gold summaries from the recently released MovieSum dataset saxena2024moviesum. For all 200 movies in the test set, we purchased the corresponding videos to use as input to our model. We were able to find videos for 175/200 test set instances. These movies span multiple fiction genres: drama, action, thriller, comedy, horror, etc. They have an average run time of 118min (range 84--228), with release dates ranging from 1950 to 2023. The gold summaries average 635 words in length. The mean number of scenes in the gold script is 151. Because all stages of our method are zero-shot, we do not need video inputs for the training set.Automated evaluation metrics are crucial for our task and for related long-form applications where human evaluation is extremely labor-intensive, costly, and difficult to design krishna-etal-2023-longeval. As there is no single agreed-upon metric for automatically summarisation, we report several complementary metrics aimed at assessing different aspects of summary quality. Rougelin-2004-rouge assesses informativeness against the reference summaries; Prismamahon-lapata-2024-modular measures factual precision and recall with respect to the gold summary; we use GPT4-turbo for both fact the extraction and evaluation stages; SummaCLaban2022SummaCRN uses NLI to measure consistency between the input document (gold screenplay) and generated summary; we use the SummaCConv version with 50 evenly-spaced bins; AlignScorezha-etal-2023-alignscore scores the 'informational alignment' between the source (gold screenplay) and the generated summary; we use the base-model checkpoint provided by the authors, and the recommended 'nli' setting with sentence chunk splitting. For both Alignscore and Prisma we score duplicated information as incorrect, to penalize LLM outputs that repeat the same sentences over and over. To measure the accuracy of our scene detection method, we use standard partition quality metrics: cluster accuracy, adjusted Rand index and normalized mutual information, as defined in mahon2024hard.Cluster accuracy 'acc', adjusted Rand index ('ari') and normalized mutual information ('nmi') of our predicted scene breaks, compared to dividing uniformly into 60, 75 and 90 and the true number of scenes ('uniform oracle').accarinmiunif-600.4610.2780.720unif-750.4410.2440.712unif-900.4300.2160.704unif-oracle0.4360.2280.710ours0.5640.3750.746To measure the accuracy of our scene segmentation method in isolation, we compare the partitions it produces to that arising from the ground truth scene breaks given in the gold screenplay. We perform dynamic time warping myers1981comparative on the dialogue lines in the gold screenplay and the timestamped utterances from the automatic transcript, in order to produce timestamps for the ground truth scene breaks. A naive metric would be the distance between the $n$th predicted break and the $n$th ground truth break, but this is inappropriate because a model that failed to predict the very first break, but got every other one exactly right, would then get a low score. Instead we treat scene break detection as a partition problem. Specifically, we consider the video as divided into 0.1s segments, where two segments are in the same element of the partition if and only if there is no scene break between them. Table \ref{['tab:scene-segmentation-results']} shows the accuracy of the predicted scene breaks, using the standard clustering metrics as defined above, compared to various versions of splitting the input into uniformly-sized scenes: splitting into 60, 75 and 90 scenes (unif-60, unif-75 and unif-90, respectively), and splitting into the ground truth number of scenes for each movie. Note that our method makes no restrictions on the possible number of scenes, it is free to predict only one scene, or as many scene as there are 0.1s segments ($\sim$60,000). In practice, it predicts somewhere between 25 and 107 scenes across the test set. We observe that our minimum description length inspired algorithm is superior to baselines based on uniform segmentation, even when the number of scenes is known in advance (see row unif-oracle). Many occasions where the model fails to predict a scene boundary occur when the scenes on either side appear visually similar. For example, in The Witch (2015), many scenes take place with the same background and characters, which presents only minor visual differences for the algorithm to detect. This suggests that many of the errors in our scene detection arise from insufficient signal in the visual feature vectors, rather than from the algorithm itself, and that future work which augments these vectors with, e.g., elements from dialogue or audio, would improve accuracy.Table \ref{['tab:name-assignment-results']} presents evaluation of our name assignment algorithm against two baselines which assign names randomly and assign all IDs the most common name, i.e., the main character. As can be seen, though there is room for improvement, our approach is more accurate by a wide margin. Multiple factors contribute to the errors in name assignment: some incorrect faces being retrieved from the database Accuracy of our assigned character names assigned compared to assigning names randomly ('random') and assigning the most common name, i.e., the main character, to all lines. Scores are averaged both across all movies ('acc movie-wise') and across all script lines in all movies ('acc line-wise').oursmost commonrandomacc movie-wise61.1219.352.97acc line-wise65.7219.622.61(though this is low due to our clique-based filtering procedure), inaccuracies in the face feature vectors, such that the same person can sometimes receive dissimilar vectors in different contexts while different people can receive sometimes similar vectors, and the speaker diarization performed by WhisperX, which sometimes gives the same character a different speaker ID, or gives the same speaker ID to two different characters. This last error is especially problematic because it makes it impossible for the assignment algorithm to find a solution with zero mistakes. We expect that future improvements in speaker diarization and face verification will reduce the prevalence of these errors. Indeed, this is one of the advantages of a modular framework: improvements in specific areas can be incorporated into the framework without needing to change the other modules.In Table \ref{['tab:main-results']}, we evaluate the summaries generated by our method. We benchmark against three baselines: 'name-only prompt' uses the parametric knowledge of the LLM without any content input, e.g., the prompt is 'Summarize the movie The Silence of the Lambs'; 'full script' uses the entire gold screenplay as input in the prompt, and for 'whisperX' the input is the WhisperX transcript. We also compare to two existing models: 'Otter AI' li2023otter, an end-to-end video description model based on video-llama2; and the modular model of mahon-lapata-2024-modular which takes videos and gold screenplays as input (described in Section \ref{['sec:related-work']}). For Otter, we divide the input video into 3min chunks, and combine the model description of each chunk. Our summaries obtain highest scores, across all metrics. The improvement is largest for the fact-based metrics of Prisma (comprised of fact-prec and fact-rec), and Alignscore. The existing modals, Otter AI and multimodal modular, especially struggle with such metrics. We find that Otter AI is mostly able to capture surface-level detail, with descriptions such as "a woman gets out of a car and goes into a building", but is unable to construct a narrative such as "a woman drives to the bank to deposit the money", so ends up capturing very little of the plot. The low scores of multimodal modular, on the other hand, are largely due to the older, smaller backbone model (BART, lewis2020bart), which often becomes decoupled from the input and produces unrelated output, highlighting the importance of incorporating current LLMs into video summarisation models. Giving only the movie name in the prompt produces reasonably high-quality summaries, confirming that Llama3.1 has significant information about these movies stored parametrically. However, these summaries are short, and when asked for a longer summary, the model repeats the same information over and over. Surprisingly, giving the full gold screenplay as input does not produce better summaries than our method or than some other baselines. This shows there is still difficulty in summarising very long text inputs. We provide example summary output in Appendix \ref{['sec:example_summaries']}. Table \ref{['tab:ablation-results']} shows the results of removing the main components of our model. In 'w/o names', we omit replacing speaker IDs with character names. This causes summary quality to drop, which shows that not only is our name assignment more accurate than baseline methods (see Table \ref{['tab:name-assignment-results']}), but it is sufficiently accurate to lead to improved downstream summaries. In 'w/o scene breaks', we feed the entire ScreenWriter input to Llama 3.1, instead of our hierarhcical approach of first summarising scenes and then fusing these to a final summary. The drop in summary performance in this setting shows the effectiveness of the hierarchical summarisation method enabled by the scene breaks in ScreenWriter. In 'unif-breaks', we still adopt the hierarchical summarisation method, but instead of using our scene breaks, split scenes into uniform chunks of length 250 tokens, which is the mean scene length from our predicted segmentation. This setting also reduces summary quality, which shows that not only is our scene segmentation more accurate than baseline methods (Table \ref{['tab:scene-segmentation-results']}), but it is sufficiently accurate to lead to improved downstream summaries. Summarisation results on MovieSum. Best results are in bold, second best are italicised..r1r2rl-sumfact-precfact-recPrismaalignscoresummacname-only prompt43.469.5341.1750.4043.0444.1653.1126.57full script42.399.3239.9448.7752.7349.0568.5925.83whisperX42.379.2239.9446.7353.6548.0068.5725.86Otter AI27.933.0626.7311.678.955.1845.9024.37multi-modular20.592.7919.9723.1623.1919.2846.3226.97ours46.4810.3244.5055.2454.7753.5772.7627.24Ablation studies on summarisation results on MovieSum. Setting 'w/o names' does not replace speaker IDs with character names using our assignment method. Setting 'w/o scene breaks' summarises the entire screenplay in one pass, rather than splitting it into scenes using our algorithm and summarising each separately.r1r2rl-sumfact-precfact-recPrismaalignscoresummacw/o names45.4610.4343.4049.9353.6449.0063.6726.45w/o scene breaks38.878.4536.8248.3251.7948.1171.9526.31unif-breaks38.878.4536.8246.5850.6948.1157.6225.73ours46.4810.3244.5055.2454.7753.5772.7627.24In this work, we proposed the task of generating automatic screenplays for movies from only video and audio input. Our model, Screenwriter, produces screenplays automatically (including dialogue, speaker names, scene breaks and visual descriptions) based on two novel algorithms: one for segmenting the video into scenes, based on the minimum description length principle and dynamic-programing for search, and one for assigning character names to dialogue utterances using a database of names and actor faces. Experimental results show that the output of ScreenWriter together with a hierarchical summarisation method can be used to generate movie plot synopses from only video and audio input. To the best of our knowledge, this is the first attempt to address this task. In the future, we would like to extend ScreenWriter's capabilities to other types of long videos, including documentaries, current affaires television programmes, and sports games.Copyright is a concern when working with movies. We respected this by purchasing all the movies used for testing.We specify the novel algorithms in detail in Section \ref{['sec:screenwriter']}. We list the specific models used for our method and for comparison models in Section \ref{['sec:experimental-setting']}. We specify prompts used in Appendix \ref{['app:prompts']}. Additionally, we have included all the code for our methods and experimental results in the supplementary material.To select keyframes, we use This extracts all keyframes into files 0001.jpg, 0002.jpg, etc, in the current working directory.Below we present the various prompts we employ for obtaining scene descriptions, and performing hierarchical summarisation. Note that Kosmos is a text completion model, so this prompt just serves as the first part of the sentence, which we then remove afterward. A shot from a movie in which .Here is the dialogue from scene $<$scene-number$>$ of the movie $<$movie-title$>$: $<$scene-dialogue-with-names$>$. Please describe its main events in bullet points. Don't include information from outside this scene. Do not answer in progressive aspect, i.e., don't use -ing verbs or "is being". In this scene, here are a few main events:Here is a sequence of summaries of each scene of a movie. $<$concatenated-dialogue-summaries$>$ Combine them into a plot synopsis of no more than 635 words. Be sure to include information from all scenes, especially those at the end, don't focus too much on early scenes. Discuss only plot events, no analysis or discussion of themes and characters. Based on the information provided, here is a plot synopsis of the move $<$movie-title$>$:Below we show the prompts used to obtain movie summaries for the various baselines and comparison systems discussed in Section \ref{['sec:results']}. The ‘name-only prompt’ uses the parametric knowledge of the LLM without any specific, content input. The 'full script' prompt uses the entire gold screenplay as input, and 'WhisperX' just the audio transcript without name assignment or scene breaks. Summarize the plot of the movie $<$movie-title$>$ in about 650 words. Do not write the summary in progressive aspect, i.e., don't use -ing verbs or "is being". Focus only on the plot events, no analysis or discussion of themes and characters.Based on the following script: $<$gold-screenplay$>$ summarize the movie $<$movie-title$>$. Do not write the summary in progressive aspect, i.e., don't use -ing verbs or "is being". Focus only on the plot events, no analysis or discussion of themes and characters.Based on the following transcript: $<$whisper-transcript$>$ summarize the movie $<$movie-title$>$. Do not write the summary in progressive aspect, i.e., don't use -ing verbs or "is bei ng". Focus only on the plot events, no analysis or discussion of themes and characters.In the following we show example summaries generated by our model and comparison systems for the movie Oppenheimer (2023). Incorrect or undesirable text is shown in red and repeated information is highlighted in gray. For comparison, we also include the gold summary from the MovieSum test set. The movie Oppenheimer begins with J. Robert Oppenheimer testifying before the Security Board, explaining that the derogatory information against him must be understood in the context of his life and work. Lewis Strauss and Gordon Gray discuss Strauss's upcoming Senate confirmation hearing for a cabinet position, and Gray advises Strauss to answer honestly about his past conflicts with Oppenheimer. The story then flashes back to Oppenheimer's early life, where he meets Niels Bohr and is introduced to the world of physics. Oppenheimer becomes involved with left-wing groups and is questioned about his communist associations. He meets with Lewis Strauss, who is trying to recruit him to run the Institute for Advanced Study at Princeton. As the story progresses, Oppenheimer becomes involved in the development of the atomic bomb and is appointed as the director of the Manhattan Engineer District. He meets with Colonel Groves and Lieutenant Colonel Nichols, who express concerns about his suitability for the job due to his suspected communist sympathies and unstable personality. Despite these concerns, Oppenheimer convinces the team to work on the project, and they begin to develop the atomic bomb. The team faces numerous challenges, including the need for vast resources and the risk of inefficiency and poor security. Oppenheimer mentions that the Germans, led by Werner Heisenberg, have a 12- to 18-month head start on developing an atomic bomb. As the project progresses, Oppenheimer becomes increasingly concerned about the implications of the atomic bomb and the potential for an arms race with the Soviets. He expresses his concerns to Lewis Strauss, who is skeptical of Oppenheimer's views. The team eventually succeeds in developing the atomic bomb, and it is dropped on Hiroshima. The bomb's power is described as equivalent to 20,000 tons of DNC, and General Groves is congratulated on the success of the mission.However, Oppenheimer's involvement in the project and his left-wing past eventually catch up with him. Lewis Strauss, who has a history of conflict with scientists, particularly Oppenheimer, begins to systematically destroy Oppenheimer's credibility by denying him his Q clearance through a bureaucratic procedure. Oppenheimer is given the chance to appeal, and a board is appointed to hear the case, with Roger Robb as the prosecutor. The hearing becomes a trial, with numerous witnesses testifying against Oppenheimer. Lewis Strauss questions Oppenheimer about his involvement in the development of the hydrogen bomb and his moral qualms about it. The Gray Board eventually announces its decision to deny J. Robert Oppenheimer's security clearance renewal. Lewis Strauss is informed that the decision was not unanimous and that there were unexpected holdouts, including Senator John F. Kennedy. Strauss reacts angrily to the news, feeling that Oppenheimer has turned the scientists against him and that he is being unfairly targeted. The movie ends with Oppenheimer calm and collected in the face of the board's decision, while Strauss is still fuming and Gordon Gray tries to reason with him. The final scenes show Oppenheimer addressing a group of people, and a truck driving through a desert town, symbolizing Oppenheimer's continued involvement in the world of physics despite the challenges he faced. Overall, the movie Oppenheimer is a story about the development of the atomic bomb and the challenges faced by J. Robert Oppenheimer, a brilliant physicist who was involved in the project. The movie explores the implications of the atomic bomb and the potential for an arms race with the Soviets, as well as the personal costs faced by Oppenheimer due to his left-wing past and his involvement in the project. The movie ends with Oppenheimer calm and collected, despite the challenges he faced, and symbolizes his continued involvement in the world of physics.The movie Oppenheimer (2023) tells the story of J. Robert Oppenheimer, the director of the Manhattan Project, which developed the atomic bomb during World War II. The film begins with Oppenheimer's appointment as the director of the project in 1942. He assembles a team of scientists, including Enrico Fermi, Ernest Lawrence, and Richard Feynman, to work on the top-secret project at Los Alamos, New Mexico. Oppenheimer's team faces numerous challenges, including the difficulty of creating a nuclear reaction and the need to develop a delivery system for the bomb. The team works tirelessly to overcome these obstacles, and they eventually succeed in creating a functional atomic bomb. As the project nears completion, Oppenheimer and his team are faced with the moral implications of their creation. They are aware that the bomb will cause widespread destruction and loss of life, and some team members express reservations about its use. The film depicts the first nuclear test, codenamed "Trinity," which takes place on July 16, 1945, in the New Mexico desert. The test is a success, and the team is ecstatic. However, Oppenheimer is visibly shaken by the experience, quoting the Bhagavad Gita, "I am become Death, destroyer of worlds." The film then jumps forward in time to 1954, when Oppenheimer is called to testify before the Atomic Energy Commission. He is questioned about his loyalty to the United States and his associations with communist sympathizers. Oppenheimer's past comes back to haunt him, and he is eventually stripped of his security clearance. The film ends with Oppenheimer's reflection on the devastating consequences of the atomic bomb and his role in its creation. He is shown to be a complex and troubled individual, grappling with the moral implications of his work. Note: The above text is a summary of the movie Oppenheimer (2023) and not the actual movie script. Here is a rewritten version of the summary in the past tense: The movie Oppenheimer (2023) tells the story of J. Robert Oppenheimer, the director of the Manhattan Project, which developed the atomic bomb during World War II. The film begins with Oppenheimer's appointment as the director of the project in 1942. He assembled a team of scientists, including Enrico Fermi, Ernest Lawrence, and Richard Feynman, to work on the top-secret project at Los Alamos, New Mexico.Oppenheimer's team faced numerous challenges, including the difficulty of creating a nuclear reaction and the need to develop a delivery system for the bomb. The team worked tirelessly to overcome these obstacles, and they eventually succeeded in creating a functional atomic bomb. As the project neared completion, Oppenheimer and his team were faced with the moral implications of their creation. They were aware that the bomb would cause widespread destruction and loss of life, and some team members expressed reservations about its use. The film depicted the first nuclear test, codenamed "Trinity," which took place on July 16, 1945, in the New Mexico desert.The movie Oppenheimer (2023) tells the story of J. Robert Oppenheimer, a brilliant physicist who played a crucial role in the development of the atomic bomb during World War II. The film begins with Oppenheimer's security clearance hearing, where he is questioned about his past associations with communists and his opposition to the development of the hydrogen bomb. The story then flashes back to Oppenheimer's early days as a professor at the University of California, Berkeley, where he meets his future wife, Kitty, and becomes friends with other prominent physicists, including Enrico Fermi and Ernest Lawrence. As World War II breaks out, Oppenheimer is recruited to lead the Manhattan Project, a secret research and development project that produces the atomic bomb. He works closely with other scientists, including Edward Teller and Leslie Groves, to develop the bomb, which is eventually dropped on Hiroshima and Nagasaki, killing hundreds of thousands of people. After the war, Oppenheimer becomes a vocal critic of the development of the hydrogen bomb, which he believes will lead to a nuclear arms race and increase the risk of nuclear war. This puts him at odds with Teller and other scientists who support the development of the hydrogen bomb. The film then cuts back to the security clearance hearing, where Oppenheimer is questioned about his past associations and his opposition to the hydrogen bomb. He is ultimately denied his security clearance, which effectively ends his career as a government advisor. The film ends with Oppenheimer attending a ceremony at the White House, where he is awarded a medal for his contributions to science. However, the ceremony is bittersweet, as Oppenheimer is aware that his work on the atomic bomb has had devastating consequences, and that he has been punished for speaking out against the development of the hydrogen bomb. The film ends with a sense of irony and tragedy, as Oppenheimer is honored for his contributions to science, while also being aware of the devastating consequences of his work. The final shot is of Oppenheimer visualizing the expanding nuclear arsenals of the world, with the sound of feet stamping in the background, symbolizing the impending doom of the world due to nuclear war.The credits roll as the sound of the feet stamping continues to build up, leaving the audience with a haunting sense of the devastating consequences of Oppenheimer's work.The final shot is of Oppenheimer visualizing the expanding nuclear arsenals of the world, with the sound of feet stamping in the background, symbolizing the impending doom of the world due to nuclear war.The credits roll as the sound of the feet stamping continues to build up, leaving the audience with a haunting sense of the devastating consequences of Oppenheimer's work.The final shot is of Oppenheimer visualizing the expanding nuclear arsenals of the world, with the sound of feet stamping in the background, symbolizing the impending doom of the world due to nuclear war.The credits roll as the sound of the feet stamping continues to build up, leaving the audience with a haunting sense of the devastating consequences of Oppenheimer's work.The final shot is of Oppenheimer visualizing the expanding nuclear of the world.In 1926, 22-year-old doctoral student J. Robert Oppenheimer grapples with anxiety and homesickness while studying under experimental physicist Patrick Blackett at the Cavendish Laboratory in the University of Cambridge. Upset with Blackett's attitude, Oppenheimer leaves him a poisoned apple but later retrieves it. Visiting scientist Niels Bohr advises Oppenheimer to study theoretical physics at the University of Göttingen instead. Oppenheimer completes his PhD there and meets fellow scientist Isidor Isaac Rabi. They later meet theoretical physicist Werner Heisenberg in Switzerland. Wanting to expand quantum physics research in the United States, Oppenheimer begins teaching at the University of California, Berkeley and the California Institute of Technology. He marries Katherine "Kitty" Puening, a biologist and ex-communist, and has an intermittent affair with Jean Tatlock, a troubled communist who later commits suicide. In December 1938, nuclear fission is discovered, which Oppenheimer realizes could be weaponized. In 1942, during World War II, U.S. Army Colonel Leslie Groves recruits Oppenheimer as director of the Manhattan Project to develop an atomic bomb. Oppenheimer, who is Jewish, is mainly concerned that the German nuclear research program, led by Heisenberg, might yield a fission bomb for the Nazis. He assembles a team consisting of Rabi, Hans Bethe and Edward Teller at the Los Alamos Laboratory, and also collaborating with scientists Enrico Fermi, Leo Szilard and David L. Hill at the University of Chicago. Teller's calculations reveal an atomic detonation could trigger a catastrophic chain reaction that ignites the atmosphere. After consulting with Albert Einstein, Oppenheimer concludes the chances are acceptably low. Teller attempts to leave the project after his proposal to construct a hydrogen bomb is rejected, but Oppenheimer convinces him to stay. After Germany's surrender in 1945, some Project scientists question the bomb's relevance; Oppenheimer believes it would end the ongoing Pacific War and save Allied lives. The Trinity test is successful, and President Harry S. Truman orders the atomic bombings of Hiroshima and Nagasaki, resulting in Japan's surrender. Though publicly praised, Oppenheimer is haunted by the mass destruction and fatalities. After expressing his personal guilt to Truman, the president berates Oppenheimer and dismisses his urging to cease further atomic development. As an advisor to the United States Atomic Energy Commission (AEC), Oppenheimer's stance generates controversy, while Teller's hydrogen bomb receives renewed interest amidst the burgeoning Cold War. AEC Chairman Lewis Strauss resents Oppenheimer for publicly dismissing his concerns about exporting radioisotopes and for recommending negotiations with the Soviet Union after they successfully detonated their own bomb. He also believes that Oppenheimer denigrated him during a conversation Oppenheimer had with Einstein in 1947. In 1954, wanting to eliminate Oppenheimer's political influence, Strauss secretly orchestrates a private security hearing before a Personnel Security Board concerning Oppenheimer's Q clearance. However, it becomes clear that the hearing has a predetermined outcome. Oppenheimer's past communist ties are exploited, and Groves' and other associates' testimony is twisted against him. Teller testifies that he lacks confidence in Oppenheimer and recommends revocation. The board revokes Oppenheimer's clearance, damaging his public image and limiting his influence on nuclear policy. In 1959, during Strauss' Senate confirmation hearing for Secretary of Commerce, Hill testifies about Strauss' personal motives in engineering Oppenheimer's downfall, resulting his nomination being voted down. In 1963, President Lyndon B. Johnson presents Oppenheimer with the Enrico Fermi Award as a gesture of political rehabilitation. A flashback reveals Oppenheimer and Einstein's 1947 conversation never mentioned Strauss. Oppenheimer instead expressed his belief that they had indeed started a chain reaction—a nuclear arms race—that would one day destroy the world.The following is our model's predicted summary for the movie Oppenheimer (2013). In a dystopian future England, a masked vigilante named V fights against a totalitarian government. The story begins with a radio show host, expressing his opinion about the former United States and its current state. He incites his listeners to take action against the United States by dumping its medical supplies into the water. Meanwhile, Evey is stopped by two Fingermen for being out past curfew. The Fingermen threaten and intimidate Evey, but V appears and kills them, saving Evey. V quotes Shakespeare as he kills the Fingermen. V introduces himself to Evey and others, explaining that he is a man in a mask with a mission of vengeance. A meeting is held to discuss the aftermath of the explosion at the Bailey building. The government is trying to spin the story of the explosion, calling it an "emergency demolition" and preparing experts to testify against the structural integrity of the old Bailey building. As the story unfolds, V broadcasts a message to London on an emergency channel, apologizing for the interruption. He explains that he is commemorating November 5th, a day that has been forgotten. V addresses Chancellor Suttler, accusing him of designing a system of oppression and censorship. V reveals that he destroyed the Old Bailey the previous night to remind the country of the importance of fairness, justice, and freedom. Evey begins to open up to V about their past, sharing a quote from their father about the difference between artists and politicians. Evey reveals that their brother was a student at St. Mary's and died, leading to their parents becoming political activists. V and Evey discuss their pasts, including Evey's desire to act and her mother's death. The story takes a turn when Evey discovers that V killed Lewis Prothero. V justifies his actions as a form of justice. Evey is shocked and upset by V's admission. V implies that he may kill more people like Prothero in the future. V and Evey discuss the morality of using violence for good and the concept of justice in a corrupt society. As the story nears its end, Evey confronts the person who tortured her in a prison cell. V explains that the torture was necessary to help Evey find the strength to live without fear. V shows Evey a letter written by a woman who died in a cell next to V's. Evey realizes that V's actions are motivated by a desire for revenge against those who wronged him and the woman who wrote the letter. In the final scenes, V and Evey meet at a location where music is playing. V asks Evey how she has avoided detection. Evey reveals that she has been using a fake ID. Chancellor Suttler plans to address the nation and warn protesters of severe consequences. Suttler's advisors discuss the possibility of the terrorist succeeding in his plan, but Suttler is confident it won't happen. V and Evey discuss their relationship and V's identity. V gives Evey a tour of the Underground, showing her the tracks that lead to Parliament. V gives Evey a gift: his home, books, gallery, and the train, leaving the choice of what to do with them to her. V explains that he will not pull the lever to blow up Parliament, as the choice belongs to the people who will shape the new world. V says goodbye to Evey and prepares to meet his maker, implying that he will sacrifice himself. Evey tries to persuade V to change his mind and leave with her, but he refuses. V confronts Chancellor Suttler in person and kills him. V removes his mask, revealing his face to the other characters, but not to the audience. In the final confrontation, V and Creedy engage in a tense standoff. Creedy taunts V, but V remains calm and confident. Creedy orders his men to kill V. V delivers a philosophical monologue about the power of ideas. V is mortally wounded, and Evey tries to stop his bleeding. V confesses his love to Evey.@phdthesis{kraft1949device, title={A device for quantizing, grouping, and coding amplitude-modulated pulses}, author={Kraft, Leon Gordon}, year={1949}, school={Massachusetts Institute of Technology} }@article{mcmillan1956two, title={Two inequalities implied by unique decipherability}, author={McMillan, Brockway}, journal={IRE Transactions on Information Theory}, volume={2}, number={4}, pages={115--116}, year={1956}, publisher={IEEE} }@article{kuhn1956variants, title={Variants of the Hungarian method for assignment problems}, author={Kuhn, Harold W}, journal={Naval Research Logistics Quarterly}, volume={3}, number={4}, pages={253--258}, year={1956}, publisher={Wiley Online Library} }@article{munkres1957algorithms, title={Algorithms for the assignment and transportation problems}, author={Munkres, James}, journal={Journal of the Society for Industrial and Applied Mathematics}, volume={5}, number={1}, pages={32--38}, year={1957}, publisher={SIAM} }@article{Laban2022SummaCRN, title={SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization}, author={Philippe Laban and Tobias Schnabel and Paul N. Bennett and Marti A. Hearst}, journal={Transactions of the Association for Computational Linguistics}, year={2022}, volume={10}, pages={163-177} }@inproceedings{mahon-lapata-2024-modular, title={A Modular Approach for Multimodal Summarization of TV Shows}, author={Mahon, Louis and Lapata, Mirella}, editor={Ku, Lun-Wei and Martins, Andre and Srikumar, Vivek}, booktitle={Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, year={2024}, address={Bangkok, Thailand}, publisher={Association for Computational Linguistics}, url={https://aclanthology.org/2024.acl-long.450}, doi={10.18653/v1/2024.acl-long.450}, pages={8272--8291} }@inproceedings{wu2021towards, title={Towards long-form video understanding}, author={Wu, Chao-Yuan and Krahenbuhl, Philipp}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, pages={1884--1894}, year={2021} }@article{saxena2024moviesum, title={MovieSum: An Abstractive Summarization Dataset for Movie Screenplays}, author={Saxena, Rohit and Keller, Frank}, journal={arXiv preprint arXiv:2408.06281}, year={2024} }@inproceedings{huang2020movienet, title={Movienet: A holistic dataset for movie understanding}, author={Huang, Qingqiu and Xiong, Yu and Rao, Anyi and Wang, Jiaze and Lin, Dahua}, booktitle={Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part IV 16}, pages={709--727}, year={2020}, organization={Springer} }@inproceedings{song2024moviechat, title={Moviechat: From dense token to sparse memory for long video understanding}, author={Song, Enxin and Chai, Wenhao and Wang, Guanhong and Zhang, Yucheng and Zhou, Haoyang and Wu, Feiyang and Chi, Haozhe and Guo, Xun and Ye, Tian and Zhang, Yanting and others}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, pages={18221--18232}, year={2024} }@article{chen2023walking, title={Walking down the memory maze: Beyond context limit through interactive reading}, author={Chen, Howard and Pasunuru, Ramakanth and Weston, Jason and Celikyilmaz, Asli}, journal={arXiv preprint arXiv:2310.05029}, year={2023} }@inproceedings{pang-etal-2023-long, title={Long Document Summarization with Top-down and Bottom-up Inference}, author={Pang, Bo and Nijkamp, Erik and Kryscinski, Wojciech and Savarese, Silvio and Zhou, Yingbo and Xiong, Caiming}, editor={Vlachos, Andreas and Augenstein, Isabelle}, booktitle={Findings of the Association for Computational Linguistics: EACL 2023}, year={2023}, address={Dubrovnik, Croatia}, publisher={Association for Computational Linguistics}, url={https://aclanthology.org/2023.findings-eacl.94}, doi={10.18653/v1/2023.findings-eacl.94}, pages={1267--1284} }@article{chang2023booookscore, title={BooookScore: A systematic exploration of book-length summarization in the era of LLMs}, author={Chang, Yapei and Lo, Kyle and Goyal, Tanya and Iyyer, Mohit}, journal={arXiv preprint arXiv:2310.00785}, year={2023} }@inproceedings{mahon2024hard, title={Hard Regularization to Prevent Deep Online Clustering Collapse without Data Augmentation}, author={Mahon, Louis and Lukasiewicz, Thomas}, booktitle={Proceedings of the AAAI Conference on Artificial Intelligence}, volume={38}, number={13}, pages={14281--14288}, year={2024} }@inproceedings{lin2022swinbert, title={Swinbert: End-to-end transformers with sparse attention for video captioning}, author={Lin, Kevin and Li, Linjie and Lin, Chung-Ching and Ahmed, Faisal and Gan, Zhe and Liu, Zicheng and Lu, Yumao and Wang, Lijuan}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, pages={17949--17958}, year={2022} }@article{peng2023kosmos, title={Kosmos-2: Grounding Multimodal Large Language Models to the World}, author={Peng, Zhiliang and Wang, Wenhui and Dong, Li and Hao, Yaru and Huang, Shaohan and Ma, Shuming and Wei, Furu}, journal={arXiv preprint arXiv:2306.14824}, year={2023} }@article{myers1981comparative, title={A comparative study of several dynamic time-warping algorithms for connected-word recognition}, author={Myers, Cory S and Rabiner, Lawrence R}, journal={Bell System Technical Journal}, volume={60}, number={7}, pages={1389--1409}, year={1981}, publisher={Wiley Online Library} }@inproceedings{papalampidi2021movie, title={Movie summarization via sparse graph construction}, author={Papalampidi, Pinelopi and Keller, Frank and Lapata, Mirella}, booktitle={Proceedings of the AAAI Conference on Artificial Intelligence}, volume={35}, pages={13631--13639}, year={2021} }@inproceedings{papalampidi2023hierarchical3d, title={Hierarchical3D Adapters for Long Video-to-text Summarization}, author={Papalampidi, Pinelopi and Lapata, Mirella}, booktitle={Findings of the Association for Computational Linguistics: EACL 2023}, pages={1267--1290}, year={2023} }@inproceedings{lin-2004-rouge, title={ROUGE: A Package for Automatic Evaluation of Summaries}, author={Lin, Chin-Yew}, booktitle={Text Summarization Branches Out}, year={2004}, address={Barcelona, Spain}, publisher={Association for Computational Linguistics}, url={https://aclanthology.org/W04-1013}, pages={74--81} }@inproceedings{lei-etal-2020-mart, title={MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning}, author={Lei, Jie and Wang, Liwei and Shen, Yelong and Yu, Dong and Berg, Tamara and Bansal, Mohit}, editor={Jurafsky, Dan and Chai, Joyce and Schluter, Natalie and Tetreault, Joel}, booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics}, year={2020}, address={Online}, publisher={Association for Computational Linguistics}, url={https://aclanthology.org/2020.acl-main.233}, doi={10.18653/v1/2020.acl-main.233}, pages={2603--2614}, abstract={Generating multi-sentence descriptions for videos is one of the most challenging captioning tasks due to its high requirements for not only visual relevance but also discourse-based coherence across the sentences in the paragraph. Towards this goal, we propose a new approach called Memory-Augmented Recurrent Transformer (MART), which uses a memory module to augment the transformer architecture. The memory module generates a highly summarized memory state from the video segments and the sentence history so as to help better prediction of the next sentence (w.r.t. coreference and repetition aspects), thus encouraging coherent paragraph generation. Extensive experiments, human evaluations, and qualitative analyses on two popular datasets ActivityNet Captions and YouCookII show that MART generates more coherent and less repetitive paragraph captions than baseline methods, while maintaining relevance to the input video events.} }@inproceedings{chen2011collecting, title={Collecting highly parallel data for paraphrase evaluation}, author={Chen, David and Dolan, William B}, booktitle={Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies}, pages={190--200}, year={2011} }@inproceedings{xu2016msr, title={Msr-vtt: A large video description dataset for bridging video and language}, author={Xu, Jun and Mei, Tao and Yao, Ting and Rui, Yong}, booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition}, pages={5288--5296}, year={2016} }@inproceedings{ZhXuCoCVPR18, author={Zhou, Luowei and Xu, Chenliang and Corso, Jason J}, title={Towards Automatic Learning of Procedures From Web Instructional Videos}, booktitle={AAAI Conference on Artificial Intelligence}, year={2018}, url={https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17344} }@inproceedings{gorinski-lapata-2015-movie, title={Movie Script Summarization as Graph-based Scene Extraction}, author={Gorinski, Philip John and Lapata, Mirella}, editor={Mihalcea, Rada and Chai, Joyce and Sarkar, Anoop}, booktitle={Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies}, year={2015}, address={Denver, Colorado}, publisher={Association for Computational Linguistics}, url={https://aclanthology.org/N15-1113}, doi={10.3115/v1/N15-1113}, pages={1066--1076} }@inproceedings{chen-etal-2022-summscreen, title={SummScreen: A Dataset for Abstractive Screenplay Summarization}, author={Chen, Mingda and Chu, Zewei and Wiseman, Sam and Gimpel, Kevin}, editor={Muresan, Smaranda and Nakov, Preslav and Villavicencio, Aline}, booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, year={2022}, address={Dublin, Ireland}, publisher={Association for Computational Linguistics}, url={https://aclanthology.org/2022.acl-long.589}, doi={10.18653/v1/2022.acl-long.589}, pages={8602--8615}, abstract={We introduce SummScreen, a summarization dataset comprised of pairs of TV series transcripts and human written recaps. The dataset provides a challenging testbed for abstractive summarization for several reasons. Plot details are often expressed indirectly in character dialogues and may be scattered across the entirety of the transcript. These details must be found and integrated to form the succinct plot descriptions in the recaps. Also, TV scripts contain content that does not directly pertain to the central plot but rather serves to develop characters or provide comic relief. This information is rarely contained in recaps. Since characters are fundamental to TV series, we also propose two entity-centric evaluation metrics. Empirically, we characterize the dataset by evaluating several methods, including neural models and those based on nearest neighbors. An oracle extractive approach outperforms all benchmarked models according to automatic metrics, showing that the neural models are unable to fully exploit the input transcripts. Human evaluation and qualitative analysis reveal that our non-oracle models are competitive with their oracle counterparts in terms of generating faithful plot events and can benefit from better content selectors. Both oracle and non-oracle models generate unfaithful facts, suggesting future research directions.} }@inproceedings{agarwal-etal-2022-creativesumm, title={CREATIVESUMM: Shared Task on Automatic Summarization for Creative Writing}, author={Agarwal, Divyansh and Fabbri, Alexander R. and Han, Simeng and Kryscinski, Wojciech and Ladhak, Faisal and Li, Bryan and McKeown, Kathleen and Radev, Dragomir and Zhang, Tianyi and Wiseman, Sam}, editor={Mckeown, Kathleen}, booktitle={Proceedings of The Workshop on Automatic Summarization for Creative Writing}, year={2022}, address={Gyeongju, Republic of Korea}, publisher={Association for Computational Linguistics}, url={https://aclanthology.org/2022.creativesumm-1.10}, pages={67--73}, abstract={This paper introduces the shared task of summrizing documents in several creative domains, namely literary texts, movie scripts, and television scripts. Summarizing these creative documents requires making complex literary interpretations, as well as understanding non-trivial temporal dependencies in texts containing varied styles of plot development and narrative structure. This poses unique challenges and is yet underexplored for text summarization systems. In this shared task, we introduce four sub-tasks and their corresponding datasets, focusing on summarizing books, movie scripts, primetime television scripts, and daytime soap opera scripts. We detail the process of curating these datasets for the task, as well as the metrics used for the evaluation of the submissions. As part of the CREATIVESUMM workshop at COLING 2022, the shared task attracted 18 submissions in total. We discuss the submissions and the baselines for each sub-task in this paper, along with directions for facilitating future work.} }@inproceedings{tapaswi2016movieqa, title={Movieqa: Understanding stories in movies through question-answering}, author={Tapaswi, Makarand and Zhu, Yukun and Stiefelhagen, Rainer and Torralb a, Antonio and Urtasun, Raquel and Fidler, Sanja}, booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition}, pages={4631--4640}, year={2016} }@inproceedings{lei-etal-2018-tvqa, title={TVQA: Localized, Compositional Video Question Answering}, author={Lei, Jie and Yu, Licheng and Bansal, Mohit and Berg, Tamara}, editor={Riloff, Ellen and Chiang, David and Hockenmaier, Julia and Tsujii, Jun'ichi}, booktitle={Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing}, year={2018}, address={Brussels, Belgium}, publisher={Association for Computational Linguistics}, url={https://aclanthology.org/D18-1167}, doi={10.18653/v1/D18-1167}, pages={1369--1379}, abstract={Recent years have witnessed an increasing interest in image-based question-answering (QA) tasks. However, due to data limitations, there has been much less work on video-based QA. In this paper, we present TVQA, a large-scale video QA dataset based on 6 popular TV shows. TVQA consists of 152,545 QA pairs from 21,793 clips, spanning over 460 hours of video. Questions are designed to be compositional in nature, requiring systems to jointly localize relevant moments within a clip, comprehend subtitle-based dialogue, and recognize relevant visual concepts. We provide analyses of this new dataset as well as several baselines and a multi-stream end-to-end trainable neural network framework for the TVQA task. The dataset is publicly available at \url{http://tvqa.cs.unc.edu}.} }@inproceedings{krishna-etal-2023-longeval, title={LongEval: Guidelines for Human Evaluation of Faithfulness in Long-form Summarization}, author={Krishna, Kalpesh and Bransom, Erin and Kuehl, Bailey and Iyyer, Mohit and Dasigi, Pradeep and Cohan, Arman and Lo, Kyle}, editor={Vlachos, Andreas and Augenstein, Isabelle}, booktitle={Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics}, year={2023}, address={Dubrovnik, Croatia}, publisher={Association for Computational Linguistics}, url={https://aclanthology.org/2023.eacl-main.121}, doi={10.18653/v1/2023.eacl-main.121}, pages={1650--1669}, abstract={While human evaluation remains best practice for accurately judging the faithfulness of automatically-generated summaries, few solutions exist to address the increased difficulty and workload when evaluating long-form summaries. Through a survey of 162 papers on long-form summarization, we first shed light on current human evaluation practices surrounding long-form summaries. We find that 73% of these papers do not perform any human evaluation on model-generated summaries, while other works face new difficulties that manifest when dealing with long documents (e.g., low inter-annotator agreement). Motivated by our survey, we present LongEval, a set of guidelines for human evaluation of faithfulness in long-form summaries that addresses the following challenges: (1) How can we achieve high inter-annotator agreement on faithfulness scores? (2) How can we minimize annotator workload while maintaining accurate faithfulness scores? and (3) Do humans benefit from automated alignment between summary and source snippets? We deploy LongEval in annotation studies on two long-form summarization datasets in different domains (SQuALITY and PubMed), and we find that switching to a finer granularity of judgment (e.g., clause-level) reduces inter-annotator variance in faithfulness scores (e.g., std-dev from 18.5 to 6.8). We also show that scores from a partial annotation of fine-grained units highly correlates with scores from a full annotation workload (0.89 Kendall's tau using 50% judgements). We release our human judgments, annotation templates, and software as a Python library for future research.} }@misc{touvron2023llama, title={LLaMA: Open and Efficient Foundation Language Models}, author={Hugo Touvron and Thibaut Lavril and Gautier Izacard and Xavier Martinet and Marie-Anne Lachaux and Timothée Lacroix and Baptiste Rozière and Naman Goyal and Eric Hambro and Faisal Azhar and Aurelien Rodriguez and Armand Joulin and Edouard Grave and Guillaume Lample}, year={2023}, eprint={2302.13971}, archiveprefix={arXiv}, primaryclass={cs.CL} }@inproceedings{lewis2020bart, title={BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension}, author={Lewis, Mike and Liu, Yinhan and Goyal, Naman and Ghazvininejad, Marjan and Mohamed, Abdelrahman and Levy, Omer and Stoyanov, Veselin and Zettlemoyer, Luke}, booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics}, year={2020}, address={Online}, publisher={Association for Computational Linguistics}, url={https://aclanthology.org/2020.acl-main.703}, doi={10.18653/v1/2020.acl-main.703}, pages={7871--7880}, abstract={We present BART, a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and other recent pretraining schemes. We evaluate a number of noising approaches, finding the best performance by both randomly shuffling the order of sentences and using a novel in-filling scheme, where spans of text are replaced with a single mask token. BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa on GLUE and SQuAD, and achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 3.5 ROUGE. BART also provides a 1.1 BLEU increase over a back-translation system for machine translation, with only target language pretraining. We also replicate other pretraining schemes within the BART framework, to understand their effect on end-task performance.} }@inproceedings{zhang2021open, title={Open-book video captioning with retrieve-copy-generate network}, author={Zhang, Ziqi and Qi, Zhongang and Yuan, Chunfeng and Shan, Ying and Li, Bing and Deng, Ying and Hu, Weiming}, booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition}, pages={9837--9846}, year={2021} }@inproceedings{pan2020spatio, title={Spatio-temporal graph for video captioning with knowledge distillation}, author={Pan, Boxiao and Cai, Haoye and Huang, De-An and Lee, Kuan-Hui and Gaidon, Adrien and Adeli, Ehsan and Niebles, Juan Carlos}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, pages={10870--10879}, year={2020} }@inproceedings{ye2022hierarchical, title={Hierarchical modular network for video captioning}, author={Ye, Hanhua and Li, Guorong and Qi, Yuankai and Wang, Shuhui and Huang, Qingming and Yang, Ming-Hsuan}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, pages={17939--17948}, year={2022} }@inproceedings{chen2023movies2scenes, title={Movies2Scenes: Using movie metadata to learn scene representation}, author={Chen, Shixing and Liu, Chun-Hao and Hao, Xiang and Nie, Xiaohan and Arap, Maxim and Hamid, Raffay}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, pages={6535--6544}, year={2023} }@article{mangalam2023egoschema, title={Egoschema: A diagnostic benchmark for very long-form video language understanding}, author={Mangalam, Karttikeya and Akshulakov, Raiymbek and Malik, Jitendra}, journal={Advances in Neural Information Processing Systems}, volume={36}, pages={46212--46244}, year={2023} }@inproceedings{han2024autoad, title={AutoAD III: The Prequel-Back to the Pixels}, author={Han, Tengda and Bain, Max and Nagrani, Arsha and Varol, Gül and Xie, Weidi and Zisserman, Andrew}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, pages={18164--18174}, year={2024} }@article{souvcek2020transnet, title={Transnet v2: An effective deep network architecture for fast shot transition detection}, author={Souček, Tomáš and Lokoč, Jakub}, journal={arXiv preprint arXiv:2008.04838}, year={2020} }@article{Rafiq:ea:2023, author={Rafiq, Ghazala and Rafiq, Muhammad and Choi, Gyu Sang}, title={Video description: A comprehensive survey of deep learning approaches}, year={2023}, issue_date={Nov 2023}, publisher={Kluwer Academic Publishers}, address={USA}, volume={56}, number={11}, issn={0269-2821}, url={https://doi.org/10.1007/s10462-023-10414-6}, doi={10.1007/s10462-023-10414-6}, abstract={Video description refers to understanding visual content and transforming that acquired understanding into automatic textual narration. It bridges the key AI fields of computer vision and natural language processing in conjunction with real-time and practical applications. Deep learning-based approaches employed for video description have demonstrated enhanced results compared to conventional approaches. The current literature lacks a thorough interpretation of the recently developed and employed sequence to sequence techniques for video description. This paper fills that gap by focusing mainly on deep learning-enabled approaches to automatic caption generation. Sequence to sequence models follow an Encoder–Decoder architecture employing a specific composition of CNN, RNN, or the variants LSTM or GRU as an encoder and decoder block. This standard-architecture can be fused with an attention mechanism to focus on a specific distinctiveness, achieving high quality results. Reinforcement learning employed within the Encoder–Decoder structure can progressively deliver state-of-the-art captions by following exploration and exploitation strategies. The transformer mechanism is a modern and efficient transductive architecture for robust output. Free from recurrence, and solely based on self-attention, it allows parallelization along with training on a massive amount of data. It can fully utilize the available GPUs for most NLP tasks. Recently, with the emergence of several versions of transformers, long term dependency handling is not an issue anymore for researchers engaged in video processing for summarization and description, or for autonomous-vehicle, surveillance, and instructional purposes. They can get auspicious directions from this research.}, journal={Artificial Intelligence Review}, pages={13293–13372}, numpages={80}, keywords={Vision to text, Video captioning, Video description approaches, Video captioning techniques, Text description, Encoder–Decoder architecture, Deep learning} }@inproceedings{papalampidi-etal-2019-movie, title={Movie Plot Analysis via Turning Point Identification}, author={Papalampidi, Pinelopi and Keller, Frank and Lapata, Mirella}, editor={Inui, Kentaro and Jiang, Jing and Ng, Vincent and Wan, Xiaojun}, booktitle={Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)}, year={2019}, address={Hong Kong, China}, publisher={Association for Computational Linguistics}, url={https://aclanthology.org/D19-1180}, doi={10.18653/v1/D19-1180}, pages={1707--1717}, abstract={According to screenwriting theory, turning points (e.g., change of plans, major setback, climax) are crucial narrative moments within a screenplay: they define the plot structure, determine its progression and segment the screenplay into thematic units (e.g., setup, complications, aftermath). We propose the task of turning point identification in movies as a means of analyzing their narrative structure. We argue that turning points and the segmentation they provide can facilitate processing long, complex narratives, such as screenplays, for summarization and question answering. We introduce a dataset consisting of screenplays and plot synopses annotated with turning points and present an end-to-end neural network model that identifies turning points in plot synopses and projects them onto scenes in screenplays. Our model outperforms strong baselines based on state-of-the-art sentence representations and the expected position of turning points.} }@article{papalampidi:ea:2024, author={P. Papalampidi and F. Keller and M. Lapata}, journal={IEEE Transactions on Pattern Analysis & Machine Intelligence}, title={Finding the Right Moment: Human-Assisted Trailer Creation via Task Composition}, year={2024}, volume={46}, number={01}, issn={1939-3539}, pages={292-304}, abstract={Movie trailers perform multiple functions: they introduce viewers to the story, convey the mood and artistic style of the film, and encourage audiences to see the movie. These diverse functions make trailer creation a challenging endeavor. In this work, we focus on finding trailer moments in a movie, i.e., shots that could be potentially included in a trailer. We decompose this task into two subtasks: narrative structure identification and sentiment prediction. We model movies as graphs, where nodes are shots and edges denote semantic relations between them. We learn these relations using joint contrastive training which distills rich textual information (e.g., characters, actions, situations) from screenplays. An unsupervised algorithm then traverses the graph and selects trailer moments from the movie that human judges prefer to ones selected by competitive supervised approaches. A main advantage of our algorithm is that it uses interpretable criteria, which allows us to deploy it in an interactive tool for trailer creation with a human in the loop. Our tool allows users to select trailer shots in under 30 minutes that are superior to fully automatic methods and comparable to (exclusive) manual selection by experts.}, keywords={motion pictures;task analysis;training;visualization;semantics;proposals;manuals}, doi={10.1109/TPAMI.2023.3323030}, publisher={IEEE Computer Society}, address={Los Alamitos, CA, USA}, month={jan} }@inproceedings{bain23_interspeech, title={WhisperX: Time-Accurate Speech Transcription of Long-Form Audio}, author={Max Bain and Jaesung Huh and Tengda Han and Andrew Zisserman}, year={2023}, booktitle={INTERSPEECH 2023}, pages={4489--4493}, doi={10.21437/Interspeech.2023-78}, issn={2958-1796} }@misc{dubey2024llama3herdmodels, title={The Llama 3 Herd of Models}, author={Abhimanyu Dubey and Abhinav Jauhri and Abhinav Pandey and Abhishek Kadian and Ahmad Al-Dahle and Aiesha Letman and Akhil Mathur and Alan Schelten and Amy Yang and Angela Fan and Anirudh Goyal and Anthony Hartshorn and Aobo Yang and Archi Mitra and Archie Sravankumar and Artem Korenev and Arthur Hinsvark and Arun Rao and Aston Zhang and Aurelien Rodriguez and Austen Gregerson and Ava Spataru and Baptiste Roziere and Bethany Biron and Binh Tang and Bobbie Chern and Charlotte Caucheteux and Chaya Nayak and Chloe Bi and Chris Marra and Chris McConnell and Christian Keller and Christophe Touret and Chunyang Wu and Corinne Wong and Cristian Canton Ferrer and Cyrus Nikolaidis and Damien Allonsius and Daniel Song and Danielle Pintz and Danny Livshits and David Esiobu and Dhruv Choudhary and Dhruv Mahajan and Diego Garcia-Olano and Diego Perino and Dieuwke Hupkes and Egor Lakomkin and Ehab AlBadawy and Elina Lobanova and Emily Dinan and Eric Michael Smith and Filip Radenovic and Frank Zhang and Gabriel Synnaeve and Gabrielle Lee and Georgia Lewis Anderson and Graeme Nail and Gregoire Mialon and Guan Pang and Guillem Cucurell and Hailey Nguyen and Hannah Korevaar and Hu Xu and Hugo Touvron and Iliyan Zarov and Imanol Arrieta Ibarra and Isabel Kloumann and Ishan Misra and Ivan Evtimov and Jade Copet and Jaewon Lee and Jan Geffert and Jana Vranes and Jason Park and Jay Mahadeokar and Jeet Shah and Jelmer van der Linde and Jennifer Billock and Jenny Hong and Jenya Lee and Jeremy Fu and Jianfeng Chi and Jianyu Huang and Jiawen Liu and Jie Wang and Jiecao Yu and Joanna Bitton and Joe Spisak and Jongsoo Park and Joseph Rocca and Joshua Johnstun and Joshua Saxe and Junteng Jia and Kalyan Vasuden Alwala and Kartikeya Upasani and Kate Plawiak and Ke Li and Kenneth Heafield and Kevin Stone and Khalid El-Arini and Krithika Iyer and Kshitiz Malik and Kuenley Chiu and Kunal Bhalla and Lauren Rantala-Yeary and Laurens van der Maaten and Lawrence Chen and Liang Tan and Liz Jenkins and Louis Martin and Lovish Madaan and Lubo Malo and Lukas Blecher and Lukas Landzaat and Luke de Oliveira and Madeline Muzzi and Mahesh Pasupuleti and Mannat Singh and Manohar Paluri and Marcin Kardas and Mathew Oldham and Mathieu Rita and Maya Pavlova and Melanie Kambadur and Mike Lewis and Min Si and Mitesh Kumar Singh and Mona Hassan and Naman Goyal and Narjes Torabi and Nikolay Bashlykov and Nikolay Bogoychev and Niladri Chatterji and Olivier Duchenne and Onur Çelebi and Patrick Alrassy and Pengchuan Zhang and Pengwei Li and Petar Vasic and Peter Weng and Prajjwal Bhargava and Pratik Dubal and Praveen Krishnan and Punit Singh Koura and Puxin Xu and Qing He and Qingxiao Dong and Ragavan Srinivasan and Raj Ganapathy and Ramon Calderer and Ricardo Silveira Cabral and Robert Stojnic and Roberta Raileanu and Rohit Girdhar and Rohit Patel and Romain Sauvestre and Ronnie Polidoro and Roshan Sumbaly and Ross Taylor and Ruan Silva and Rui Hou and Rui Wang and Saghar Hosseini and Sahana Chennabasappa and Sanjay Singh and Sean Bell and Seohyun Sonia Kim and Sergey Edunov and Shaoliang Nie and Sharan Narang and Sharath Raparthy and Sheng Shen and Shengye Wan and Shruti Bhosale and Shun Zhang and Simon Vandenhende and Soumya Batra and Spencer Whitman and Sten Sootla and Stephane Collot and Suchin Gururangan and Sydney Borodinsky and Tamar Herman and Tara Fowler and Tarek Sheasha and Thomas Georgiou and Thomas Scialom and Tobias Speckbacher and Todor Mihaylov and Tong Xiao and Ujjwal Karn and Vedanuj Goswami and Vibhor Gupta and Vignesh Ramanathan and Viktor Kerkez and Vincent Gonguet and Virginie Do and Vish Vogeti and Vladan Petrovic and Weiwei Chu and Wenhan Xiong and Wenyin Fu and Whitney Meers and Xavier Martinet and Xiaodong Wang and Xiaoqing Ellen Tan and Xinfeng Xie and Xuchao Jia and Xuewei Wang and Yaelle Goldschlag and Yashesh Gaur and Yasmine Babaei and Yi Wen and Yiwen Song and Yuchen Zhang and Yue Li and Yuning Mao and Zacharie Delpierre Coudert and Zheng Yan and Zhengxing Chen and Zoe Papakipos and Aaditya Singh and Aaron Grattafiori and Abha Jain and Adam Kelsey and Adam Shajnfeld and Adithya Gangidi and Adolfo Victoria and Ahuva Goldstand and Ajay Menon and Ajay Sharma and Alex Boesenberg and Alex Vaughan and Alexei Baevski and Allie Feinstein and Amanda Kallet and Amit Sangani and Anam Yunus and Andrei Lupu and Andres Alvarado and Andrew Caples and Andrew Gu and Andrew Ho and Andrew Poulton and Andrew Ryan and Ankit Ramchandani and Annie Franco and Aparajita Saraf and Arkabandhu Chowdhury and Ashley Gabriel and Ashwin Bharambe and Assaf Eisenman and Azadeh Yazdan and Beau James and Ben Maurer and Benjamin Leonhardi and Bernie Huang and Beth Loyd and Beto De Paola and Bhargavi Paranjape and Bing Liu and Bo Wu and Boyu Ni and Braden Hancock and Bram Wasti and Brandon Spence and Brani Stojkovic and Brian Gamido and Britt Montalvo and Carl Parker and Carly Burton and Catalina Mejia and Changhan Wang and Changkyu Kim and Chao Zhou and Chester Hu and Ching-Hsiang Chu and Chris Cai and Chris Tindal and Christoph Feichtenhofer and Damon Civin and Dana Beaty and Daniel Kreymer and Daniel Li and Danny Wyatt and David Adkins and David Xu and Davide Testuggine and Delia David and Devi Parikh and Diana Liskovich and Didem Foss and Dingkang Wang and Duc Le and Dustin Holland and Edward Dowling and Eissa Jamil and Elaine Montgomery and Eleonora Presani and Emily Hahn and Emily Wood and Erik Brinkman and Esteban Arcaute and Evan Dunbar and Evan Smothers and Fei Sun and Felix Kreuk and Feng Tian and Firat Ozgenel and Francesco Caggioni and Francisco Guzmán and Frank Kanayet and Frank Seide and Gabriela Medina Florez and Gabriella Schwarz and Gada Badeer and Georgia Swee and Gil Halpern and Govind Thattai and Grant Herman and Grigory Sizov and Guangyi and Zhang and Guna Lakshminarayanan and Hamid Shojanazeri and Han Zou and Hannah Wang and Hanwen Zha and Haroun Habeeb and Harrison Rudolph and Helen Suk and Henry Aspegren and Hunter Goldman and Ibrahim Damlaj and Igor Molybog and Igor Tufanov and Irina-Elena Veliche and Itai Gat and Jake Weissman and James Geboski and James Kohli and Japhet Asher and Jean-Baptiste Gaya and Jeff Marcus and Jeff Tang and Jennifer Chan and Jenny Zhen and Jeremy Reizenstein and Jeremy Teboul and Jessica Zhong and Jian Jin and Jingyi Yang and Joe Cummings and Jon Carvill and Jon Shepard and Jonathan McPhie and Jonathan Torres and Josh Ginsburg and Junjie Wang and Kai Wu and Kam Hou U and Karan Saxena and Karthik Prasad and Kartikay Khandelwal and Katayoun Zand and Kathy Matosich and Kaushik Veeraraghavan and Kelly Michelena and Keqian Li and Kun Huang and Kunal Chawla and Kushal Lakhotia and Kyle Huang and Lailin Chen and Lakshya Garg and Lavender A and Leandro Silva and Lee Bell and Lei Zhang and Liangpeng Guo and Licheng Yu and Liron Moshkovich and Luca Wehrstedt and Madian Khabsa and Manav Avalani and Manish Bhatt and Maria Tsimpoukelli and Martynas Mankus and Matan Hasson and Matthew Lennie and Matthias Reso and Maxim Groshev and Maxim Naumov and Maya Lathi and Meghan Keneally and Michael L. Seltzer and Michal Valko and Michelle Restrepo and Mihir Patel and Mik Vyatskov and Mikayel Samvelyan and Mike Clark and Mike Macey and Mike Wang and Miquel Jubert Hermoso and Mo Metanat and Mohammad Rastegari and Munish Bansal and Nandhini Santhanam and Natascha Parks and Natasha White and Navyata Bawa and Nayan Singhal and Nick Egebo and Nicolas Usunier and Nikolay Pavlovich Laptev and Ning Dong and Ning Zhang and Norman Cheng and Oleg Chernoguz and Olivia Hart and Omkar Salpekar and Ozlem Kalinli and Parkin Kent and Parth Parekh and Paul Saab and Pavan Balaji and Pedro Rittner and Philip Bontrager and Pierre Roux and Piotr Dollar and Polina Zvyagina and Prashant Ratanchandani and Pritish Yuvraj and Qian Liang and Rachad Alao and Rachel Rodriguez and Rafi Ayub and Raghotham Murthy and Raghu Nayani and Rahul Mitra and Raymond Li and Rebekkah Hogan and Robin Battey and Rocky Wang and Rohan Maheswari and Russ Howes and Ruty Rinott and Sai Jayesh Bondu and Samyak Datta and Sara Chugh and Sara Hunt and Sargun Dhillon and Sasha Sidorov and Satadru Pan and Saurabh Verma and Seiji Yamamoto and Sharadh Ramaswamy and Shaun Lindsay and Shaun Lindsay and Sheng Feng and Shenghao Lin and Shengxin Cindy Zha and Shiva Shankar and Shuqiang Zhang and Shuqiang Zhang and Sinong Wang and Sneha Agarwal and Soji Sajuyigbe and Soumith Chintala and Stephanie Max and Stephen Chen and Steve Kehoe and Steve Satterfield and Sudarshan Govindaprasad and Sumit Gupta and Sungmin Cho and Sunny Virk and Suraj Subramanian and Sy Choudhury and Sydney Goldman and Tal Remez and Tamar Glaser and Tamara Best and Thilo Kohler and Thomas Robinson and Tianhe Li and Tianjun Zhang and Tim Matthews and Timothy Chou and Tzook Shaked and Varun Vontimitta and Victoria Ajayi and Victoria Montanez and Vijai Mohan and Vinay Satish Kumar and Vishal Mangla and Vítor Albiero and Vlad Ionescu and Vlad Poenaru and Vlad Tiberiu Mihailescu and Vladimir Ivanov and Wei Li and Wenchen Wang and Wenwen Jiang and Wes Bouaziz and Will Constable and Xiaocheng Tang and Xiaofang Wang and Xiaojian Wu and Xiaolan Wang and Xide Xia and Xilun Wu and Xinbo Gao and Yanjun Chen and Ye Hu and Ye Jia and Ye Qi and Yenda Li and Yilin Zhang and Ying Zhang and Yossi Adi and Youngjin Nam and Yu and Wang and Yuchen Hao and Yundi Qian and Yuzi He and Zach Rait and Zachary DeVito and Zef Rosnbrick and Zhaoduo Wen and Zhenyu Yang and Zhiwei Zhao}, year={2024}, eprint={2407.21783}, archiveprefix={arXiv}, primaryclass={cs.AI}, url={https://arxiv.org/abs/2407.21783} }@inproceedings{pmlr-v139-radford21a, title={Learning Transferable Visual Models From Natural Language Supervision}, author={Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and Krueger, Gretchen and Sutskever, Ilya}, booktitle={Proceedings of the 38th International Conference on Machine Learning}, pages={8748--8763}, year={2021}, editor={Meila, Marina and Zhang, Tong}, volume={139}, series={Proceedings of Machine Learning Research}, month={18--24 Jul}, publisher={PMLR}, pdf={http://proceedings.mlr.press/v139/radford21a/radford21a.pdf}, url={https://proceedings.mlr.press/v139/radford21a.html}, abstract={State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on.} }@inproceedings{zha-etal-2023-alignscore, title={AlignScore: Evaluating Factual Consistency with A Unified Alignment Function}, author={Zha, Yuheng and Yang, Yichi and Li, Ruichen and Hu, Zhiting}, editor={Rogers, Anna and Boyd-Graber, Jordan and Okazaki, Naoaki}, booktitle={Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, year={2023}, address={Toronto, Canada}, publisher={Association for Computational Linguistics}, url={https://aclanthology.org/2023.acl-long.634}, doi={10.18653/v1/2023.acl-long.634}, pages={11328--11348}, abstract={Many text generation applications require the generated text to be factually consistent with input information. Automatic evaluation of factual consistency is challenging. Previous work has developed various metrics that often depend on specific functions, such as natural language inference (NLI) or question answering (QA), trained on limited data. Those metrics thus can hardly assess diverse factual inconsistencies (e.g., contradictions, hallucinations) that occur in varying inputs/outputs (e.g., sentences, documents) from different tasks. In this paper, we propose AlignScore, a new holistic metric that applies to a variety of factual inconsistency scenarios as above. AlignScore is based on a general function of information alignment between two arbitrary text pieces. Crucially, we develop a unified training framework of the alignment function by integrating a large diversity of data sources, resulting in 4.7M training examples from 7 well-established tasks (NLI, QA, paraphrasing, fact verification, information retrieval, semantic similarity, and summarization). We conduct extensive experiments on large-scale benchmarks including 22 evaluation datasets, where 19 of the datasets were never seen in the alignment training. AlignScore achieves substantial improvement over a wide range of previous metrics. Moreover, AlignScore (355M parameters) matches or even outperforms metrics based on ChatGPT and GPT-4 that are orders of magnitude larger.} }@article{li2023otter, title={Otter: A Multi-Modal Model with In-Context Instruction Tuning}, author={Li, Bo and Zhang, Yuanhan and Chen, Liangyu and Wang, Jinghao and Yang, Jingkang and Liu, Ziwei}, journal={arXiv preprint arXiv:2305.03726}, year={2023} }