YsummarY, use Tab ↹, Return/Enter and go back (⌘ + ←) to navigate.

Living in Big Data with Vector Functional Programming • Dave Thomas • YOW! 2014

YouTube Video

Summary

This YouTube transcript presents a talk on Vector Functional Programming as a powerful and efficient approach to data analysis, particularly for large and messy datasets. The speaker argues for its advantages over traditional “Big Data” tools and object-oriented programming, especially for “thinkers” – domain experts who need interactive data exploration without becoming software engineers.

Here’s a breakdown of the key points:

  • Motivation and Rebranding: The speaker rebrands array programming as “Vector Functional Programming” to make it more approachable. They were frustrated with the term “Big Data” and instead focus on “copious data”. They aim to explain why array programming is a good choice, moving away from just showing impressive demos.
  • The Problem: Many organizations have powerful batch processing tools (like Hadoop) but lack interactive tools for analysts to “think in their data.” The target users are domain experts (finance, biology, law) who are comfortable with scripting but not professional programmers. They need to interactively analyze terabytes of messy, real-world data.
  • Characteristics of Real Data: Real data is messy, highly variable, often time-series based, and contains missing and out-of-band values. It also has uncertainty. Traditional languages and databases often struggle with these aspects.
  • Data Languages vs. Languages with Databases: The speaker differentiates between languages designed to work directly with data (data languages) and languages that interact with separate databases. They criticize the performance overhead of moving data between databases and client-side processing, even with columnar stores.
  • Top-Down vs. Bottom-Up Language Design: Haskell (top-down - elegant abstractions, pure functions) is contrasted with array programming (bottom-up - engineered for speed, rectangles). While Haskell is elegant, the speaker argues that for performance, especially with hardware limitations, a bottom-up “engineering” approach focused on rectangular data structures is more effective.
  • Array Virtual Machines: Array VMs are presented as faster and simpler than object VMs. They naturally align with column stores, eliminating impedance mismatches. Serialization is trivial, and data parallelism is easily implemented.
  • Why Vector Programming over Lazy Functional VMs? The choice was made partly out of ignorance (at the time, 6 years prior) and partly due to the need for an interactive environment on a commercial platform. Fixed types were chosen to reduce abstraction overhead and code bloat. Refactoring tools for large codebases are deemed lacking even in elegant functional languages.
  • Hardware Affinity: Computers work best with rectangular data structures. Vector programming aligns with this, leading to performance gains.
  • History of Vector Languages (APL Family): The talk traces the lineage from APL (originally a hardware description language that “escaped”), through APL2, APL+, to more modern languages like J, K, and Q. Key figures like Ken Iverson, Trenchard Moore, Arthur Whitney, and Roger Hui are mentioned.
  • APL Examples & Expressiveness: Examples like the Game of Life and Sudoku solvers in APL are used to demonstrate the conciseness and power of these languages. The concept of “no stinking loops” and the focus on character count for program length are highlighted.
  • Vector Programming Style: It’s applicative, expression-oriented, and mostly functional (despite some imperative features). Programmers are encouraged to start with smaller expressions and then compose them.
  • Learning Vector Programming: It’s likened to learning Scheme or Lisp - requiring practice and immersion in idioms. The initial focus should be on collection programming and mastering nouns, verbs, and adverbs of the language.
  • Q and K: Q is presented as a DSL for finance built on K, a more hardcore vector language. Arthur Whitney’s philosophy of prioritizing innovation over backward compatibility in K is discussed. Q offers a more approachable syntax than K.
  • Examples of Q’s Power: A Sudoku solver in Q is compared to K, showing the increased verbosity of Q for readability. A complete text editor in K in only 78 characters demonstrates extreme conciseness. The simplicity of KSQL (SQL dialect within K) is highlighted.
  • Performance and Productivity: Vector programming is described as highly productive and fast. Word count examples comparing Scalding, Hive, and Q code showcase Q’s efficiency and conciseness.
  • Challenges and Future Directions: Documentation, error messages, and type systems are areas for improvement. However, the ease of use for domain experts is emphasized. The talk concludes by highlighting the potential of vector functional programming to move beyond niche applications and become more widely adopted, especially with advancements inspired by functional programming concepts.

Accuracy

The information provided in the transcript is generally accurate regarding established knowledge about programming languages and data processing paradigms. Here’s a more detailed breakdown:

  • Historical Claims about APL, J, K, and their creators: The historical accounts of Ken Iverson, Arthur Whitney, Roger Hui, and the development of APL, J, and K are largely accurate. APL was indeed initially conceived as a mathematical notation and evolved into a programming language. Ken Iverson did receive the Turing Award for APL. Arthur Whitney did create the K language and was involved in the early J interpreter. The characterization of J as CISC and K as RISC in terms of complexity and verbosity is a common analogy used within the community.
  • Performance of Array VMs vs. Object VMs: The claim that array VMs can be faster and simpler than object VMs, especially for numerical and data-parallel tasks, is generally accepted. Array-oriented operations often map well to hardware, and the reduced overhead of object management in array VMs contributes to performance.
  • Expressiveness and Conciseness of Array Languages: The examples of Game of Life and Sudoku in APL (and by extension, J and K) accurately represent the potential for extreme conciseness in these languages. The “one-liner” culture and focus on character count are real aspects of the APL community.
  • Challenges of Refactoring in Functional Languages: While functional languages are often praised for refactoring due to immutability and pure functions, the speaker’s point about the lack of mature, widely-used refactoring tools for large codebases, even in languages like Haskell and Clojure, has some validity. Refactoring tools are generally less sophisticated compared to those available for languages like Java or C#.
  • Limitations of Java regarding closures, tail recursion, serialization: The statements about Java’s historical (and to some extent, ongoing) weaknesses in areas like closures, tail recursion, and efficient serialization are also accurate. While Java has added closures, tail recursion optimization remains absent in the JVM, and serialization in Java has known performance and security issues.
  • Efficiency of Vector Operations and Data Parallelism: The assertion that vector operations are inherently efficient and lend themselves well to data parallelism is a core principle of array programming and is supported by both theoretical and practical evidence. Array operations can be highly optimized and parallelized, leading to significant performance gains.

Minor Nuances and Potential Overstatements:

  • “No serious refactoring tools for any decent amount of code”: This is a strong statement and might be an overgeneralization. While refactoring tools in functional languages may not be as feature-rich as in some other paradigms, tools and techniques for refactoring do exist and are used. The degree of “seriousness” is subjective.
  • “Hardware doesn’t do lazy things all that well”: While there’s truth in the idea that eager evaluation can sometimes be more hardware-friendly, especially for certain types of operations, modern hardware and compiler optimizations can handle lazy evaluation quite effectively. The performance difference isn’t always as stark as implied, and lazy evaluation offers other advantages (like composability and handling infinite data structures).
  • “C under the misbelief that you actually know what C is doing as a programming language today”: This is a humorous jab at the complexity of modern C and the potential for subtle bugs. While C can be complex, it remains a widely used and well-understood language for systems programming and performance-critical code.

Overall Accuracy: The transcript provides a reasonably accurate and informative overview of vector functional programming, its history, advantages, and challenges. The speaker’s enthusiasm for the paradigm is evident, and while some points might be slightly exaggerated for rhetorical effect, the core information is consistent with established knowledge.

Resources

Here are the top 5 most relevant resources to learn more about the subjects presented in the transcript:

  1. Jsoftware Website (jsoftware.com): J is a modern descendant of APL and a powerful array programming language. The Jsoftware website is the official resource, offering:

    • Free Downloads: You can download and use J for free.
    • Documentation and Tutorials: Extensive documentation, tutorials, and examples to get started with J.
    • J Wiki: A community-driven wiki with a wealth of information and resources.
    • Forums and Community: Active forums where you can ask questions and interact with other J programmers.
    • Why it’s relevant: J is mentioned as a key language in the talk. Learning J directly allows you to explore the concepts of vector programming and array-based thinking discussed in the transcript.
  2. KX.com (kx.com): KX is the company behind kdb+ and Q, the language prominently featured in the talk as used in finance. This website provides:

    • kdb+ and Q Information: Details about kdb+, a high-performance time-series database and its query language Q.
    • Documentation: Documentation for Q and kdb+.
    • Free Trial: KX offers a free personal edition of kdb+ for non-commercial use, allowing you to experiment with Q.
    • Community and Support: Access to community forums and professional support options.
    • Why it’s relevant: Q and kdb+ are central to the talk’s arguments about practical applications in finance and data analysis. Exploring kdb+ and Q provides hands-on experience with the specific “Vector Functional Programming” dialect discussed.
  3. “Mastering Dyalog APL” by Bernard Legrand: (Available online and in print - search for it). Dyalog APL is a commercial APL implementation but offers a free personal edition. This book is highly recommended for learning APL and the array programming paradigm in depth.

    • Comprehensive APL Learning Resource: Covers APL from basic concepts to advanced techniques.
    • Focus on Array Thinking: Teaches you how to think in terms of arrays and vector operations.
    • Practical Examples: Includes numerous examples and exercises to reinforce learning.
    • Why it’s relevant: APL is the ancestor of J and K. Understanding APL provides a strong foundation for understanding the principles behind vector functional programming and the languages discussed.
  4. “Concrete Mathematics: A Foundation for Computer Science” by Graham, Knuth, and Patashnik: (Available widely online and in print). While not directly about array programming, this book is highly relevant to the mindset and mathematical foundations that underpin efficient algorithm design and the kind of thinking prevalent in vector languages.

    • Mathematical Tools for Computer Science: Covers essential mathematical tools like summations, recurrences, generating functions, and asymptotic analysis.
    • Emphasis on Problem Solving: Develops problem-solving skills crucial for efficient programming.
    • Connects Math and Computation: Bridges the gap between mathematical thinking and computational efficiency.
    • Why it’s relevant: Vector languages often encourage a mathematical, formulaic approach to problem-solving. This book helps develop the mathematical maturity that can enhance your understanding and effectiveness with such languages.
  5. “Thinking Like a Vector” Blog Series by Bob Therriault (ArrayCast Podcast website): (Search for “ArrayCast Podcast” and look for blog posts). This blog series (and the ArrayCast podcast itself) directly addresses the concepts and mindset of array programming in an accessible and engaging way.

    • Focus on Array Programming Mindset: Explores the unique way of thinking required for effective array programming.
    • Practical Advice and Examples: Provides practical tips and examples to help you “think like a vector.”
    • Covers Various Array Languages: Often discusses concepts relevant across different array languages (APL, J, K, etc.).
    • Community Engagement: Part of a broader community focused on array programming.
    • Why it’s relevant: This resource directly tackles the challenge of learning to “think in rectangles” as mentioned in the transcript, providing guidance on adopting the vector programming mindset.

These resources offer a mix of practical tools (J, kdb+), in-depth learning materials (APL book), foundational knowledge (Concrete Mathematics), and mindset development (Thinking Like a Vector blog), providing a comprehensive starting point for anyone interested in exploring Vector Functional Programming further based on the transcript’s content.

Next: Skąd bierze się kryzys psychiczny młodego pokolenia? || prof. Marcin Matczak #55
Prev: The Soviet Obsession With Venus Revealed