Addressing the Data Deluge

From the Fall 2012 special issue of Columbia Engineering Magazine
As computational technology has advanced over time, so has the abundance of data being generated, collected, and stored in systems around the globe. Along with technological changes, society is undergoing a dramatic transition from a “data poor” to a “data rich” environment in both scientific and business applications. The huge abundance, complexity, and variety of the data that are being produced are challenging scientists and industry alike. This so-called data deluge is arising from the growth of online resources as well as monitored online consumer behavior also provide new sources of data. While the data deluge continues to raise concerns about personal privacy, the possibilities to create value through the intelligent use and mining growth, transform decision making, and develop solutions to problems of societal concern, worldwide.
Columbia’s new Institute for Data Sciences and Engineering will zero in on just that—big data and its full potential. Funded in part through an award from New York City’s Economic Development Corporation (NYCEDC), the Institute will enable engineering and applied science researchers to obtain the education, resources, and collaborations necessary to translate a data-rich environment into informational discoveries that offer tremendous potential in innovation, commercial enterprise, and workforce development. This new Institute will comprise six core centers of study and entrepreneurship—focused on New Media, Smart Cities, Health Analytics, Cybersecurity, Financial Analytics, and Foundations for Data Science.
The 2011 McKinsey report presents the massive scale of what big data represents in its recent assessment of the economy. Among the leading indicators are five billion mobile phones in use in 2010, and 15 out of 17 major sectors in the United States having more data stored per company than the U.S. Library of Congress. The report also states that more than 30 million networked sensor nodes are now present in the transportation, automotive, industrial, utilities, and retail sectors, while over half a billion people worldwide are using smart phones. McKinsey’s report “expects big data to rapidly become a key determinant of competition across sectors,” noting that this is exactly where the workforce will experience a gap in the coming years, with demand exceeding supply by “140,000 to 190,000 positions” in 2012. It is exactly such a diversity of challenges and opportunities that our Institute targets.
The late 1990s featured Silicon Valley Internet start-ups, Boston biotech start-ups, and Washington, D.C., “beltway bandits” to support defense and intelligence agency needs. These companies grew organically from the needs, talent, and culture of the local environment. We envision a similar organic growth of start-ups in New York City, addressing the needs and interests of our environment.
In New York, the media capital of the world, companies struggle to shift to a new digital paradigm, advertising and marketing turn online, and youth turns to new forms of social media. These changes set the stage for a focus on innovation in our New Media Center. In the New Media Section of this special issue of the magazine, Professor Shih-Fu Chang of the Department of Electrical Engineering highlights cutting-edge research at Columbia in the analysis and creation of visual media, such as video, while Professor Michael Collins of Computer Science discusses advances in the analysis and creation of online language, as is carried out, for example, in machine translation. These faculty members highlight their own work as well as that of the many other faculty within the Engineering School who work on the analysis of a wide variety of media, including text, speech, image, video, and social media.
New York also faces challenges posed by an aging infrastructure, a need to improve its energy efficiency, and the potential to use data-enabled technology to help its concentrated population live more efficiently. These and other issues set the stage for a focus on data-enabled innovation in our Smart Cities Center. In the section Smart Cities, Professor Raimondo Betti of the Department of Civil Engineering and Engineering Mechanics describes research that uses advanced sensing to monitor the health of infrastructure in New York and other cities, including vital civil infrastructure such as bridges, while Professor Vijay Modi of the Department of Mechanical Engineering writes about the role of big data in increasing urban energy efficiency. Professor Kartik Chandran writes about technology that can aid in clean water supplies. Faculty from Electrical Engineering, Earth and Environmental Engineering, Civil Engineering, Computer Science, and Mechanical Engineering, as well as researchers from the Center for Computational Learning Systems, address a wide range of problems in this area.
With a diverse population in need of health care and preventative medical interventions, national health care costs have skyrocketed. New York City’s hospitals, including those of the Columbia University Medical Center, are the most advanced in the world in their use of online patient data. Growing demand for effective health care combined with our local talent base in this area set the stage for a focus in innovation in our new Health Analytics Center. In the section Health Analytics, Professor Andrew Laine of the Department of Biomedical Engineering describes research on the analysis of large data sets resulting from medical imaging, which helps to improve patient care, while Professor Chris Wiggins of Applied Physics and Applied Mathematics writes about the need for interdisciplinary research of the human genome, an endeavor that promises to provide new understanding of diseases that were previously difficult to prevent and treat.
Faculty from the Morningside campus often collaborate with faculty from the Columbia University Medical campus, where biomedical informatics researchers work with clinical data to improve health care, and Professor Andrea Califano’s group works on problems in systems biology involving large genomic datasets.
With the almost immeasurable reams of data generated every minute of every day, worldwide, comes a commensurate need to keep data secure and private for its lifetime—for both institutions and individuals that rely on and generate that information. Greater research, technology, and business development in the sphere of security is critical, and we are forming a new Cybersecurity Center as a key part of the new Institute for Data Sciences and Engineering. In the section on Cybersecurity, Professor Angelos Keromytis of the Department of Computer Science explains the new security challenges that arise when computation takes place in the “cloud,” a new way of supporting large data applications that is becoming rapidly embraced by data users around the globe. Other faculty within the Computer Science Department work on problems ranging from cryptographic theory to policy to algorithms that ensure secure systems.
And finally, as New York is the finance capital of the world, it demands technology experts and innovative new approaches to data comprehension, capture, curation, and management—with tremendous opportunities for both entrepreneurship and workforce development. The demands of the finance sector also require particular expertise and drain talent from other industries and sectors. As such, our new Financial Analytics Center will cultivate a larger talent pool and workforce, as well as the technology and applications necessary to further advance this critical sector of the New York City business community. In the section on Finance, Professor Garud Iyengar of the Department of Industrial Engineering and Operations Research (IEOR) discusses new methods for analyzing data that can help with financial risk management, an important approach to avoid the problems we have seen in the financial industry in the last few years. Other faculty within IEOR and the Computer Science Department work on related problems, as do faculty within the Columbia Business School.
About the authors
Institute Associate Director Patricia J. Culligan, professor of civil engineering, is a leader in the field of water resources and urban sustai-ability. She has worked extensively with The Earth Institute’s Urban Design Lab at Columbia Univer-sity to explore novel, interdisciplinary solutions to the modern day challenges of urbanization, with a particular emphasis on the City of New York.
Kathleen R. McKeown is the inaugural director of Columbia’s Institute for Data Sciences and Engineering and also is the Henry and Ger-trude Rothschild Professor of Computer Science at the Engineering School. A leading scholar and researcher in the field of natual language processing, McKeown focuses her research on big data; her interests include text summarization, question answering, natural language genera-tion, multimedia explanation, digital libraries, and multilingual applications.To support and amplify the work of five Institute centers, which all lie at the heart of New York City’s innovation economy, the Institute also will conduct core research on problems that cut across the data sciences and engineering. The research will focus on formal and mathematical models for data processing, as well as on issues concerning the engineering of large-scale data collection, aggregation, transmission, and processing systems. In the section on core research, Professors Keren Bergman and Gil Zussman of the Department of Electrical Engineering discuss Columbia University’s historic and ongoing contributions to the development of the Internet, highlighting new interdisciplinary research in the field of intelligent optical devices that has potential to completely transform network services of the future. Core research within the Institute will also focus on problems in machine learning and data analytics, collaborating with faculty across all centers to apply new techniques to problems they are addressing.
A key focus of the Data Sciences Institute will be on translational research and interaction with industry. As Brynjolfsson, Hitt, and Kim (2011) discovered in their survey of 179 large corporations, those companies that have adopted a “data-driven” decision-making process had 5 to 6 percent greater productivity than companies that followed a more traditional “intuition and experience” approach. Borrowing from the medical field’s translational paradigm “from bench to bedside,” the new Institute will address the continuum from “data to innovation” through a program that spans from basic scientific research through to solutions and technology transfer. New educational models and products will be built to attract and train a diverse cadre of students with the talents to exploit the value of a data-rich society.
The Institute for Data Sciences and Engineering will be led by The Fu Foundation School of Engineering and Applied Science in close collaboration with seven other schools within the University: Columbia Business School, the Graduate School of Arts and Sciences, the Mailman School of Public Health, the College of Physicians and Surgeons, Columbia Journalism School, the School of International and Public Affairs, and the Graduate School of Architecture, Planning and Preservation. Through interaction with a coalition of industry and community partners and startups and the NYCEDC, the Institute for Data Sciences and Engineering will form an innovation hub that can help harness the power of our data-rich society through novel research and enterprises that have local, national, and global impact.



